+ All Categories
Home > Documents > Big Data: A Survey Min Chen

Big Data: A Survey Min Chen

Date post: 11-Dec-2016
Category:
Upload: doankien
View: 220 times
Download: 1 times
Share this document with a friend
39
Big Data: A Survey Min Chen · Shiwen Mao · Yunhao Liu © Springer Science+Business Media New York 2014 Abstract In this paper, we review the background and state-of-the-art of big data. We first introduce the general background of big data and review related technologies, such as could computing, Internet of Things, data centers, and Hadoop. We then focus on the four phases of the value chain of big data, i.e., data generation, data acquisition, data storage, and data analysis. For each phase, we introduce the general background, discuss the technical challenges, and review the latest advances. We finally examine the several representative applications of big data, including enterprise management, Internet of Things, online social networks, medial applications, collective intelligence, and smart grid. These discussions aim to provide a comprehensive overview and big-picture to readers of this exciting area. This survey is concluded with a discussion of open problems and future directions. Keywords Big data · Cloud computing · Internet of things · Data center · Hadoop · Smart grid · Big data analysis M. Chen () School of Computer Science and Technology, Huazhong University of Science and Technology, 1037 Luoyu Road, Wuhan, 430074, China e-mail: [email protected]; [email protected] S. Mao Department of Electrical & Computer Engineering, Auburn University, 200 Broun Hall, Auburn, AL 36849-5201, USA e-mail: [email protected] Y. Liu TNLIST, School of Software, Tsinghua University, Beijing, China e-mail: [email protected] 1 Background 1.1 Dawn of big data era Over the past 20 years, data has increased in a large scale in various fields. According to a report from International Data Corporation (IDC), in 2011, the overall created and copied data volume in the world was 1.8ZB (10 21 B ), which increased by nearly nine times within five years [1]. This figure will double at least every other two years in the near future. Under the explosive increase of global data, the term of big data is mainly used to describe enormous datasets. Com- pared with traditional datasets, big data typically includes masses of unstructured data that need more real-time analy- sis. In addition, big data also brings about new opportunities for discovering new values, helps us to gain an in-depth understanding of the hidden values, and also incurs new challenges, e.g., how to effectively organize and manage such datasets. Recently, industries become interested in the high poten- tial of big data, and many government agencies announced major plans to accelerate big data research and applica- tions [2]. In addition, issues on big data are often covered in public media, such as The Economist [3, 4], New York Times [5], and National Public Radio [6, 7]. Two pre- mier scientific journals, Nature and Science, also opened special columns to discuss the challenges and impacts of big data [8, 9]. The era of big data has come beyond all doubt [10]. Nowadays, big data related to the service of Internet com- panies grow rapidly. For example, Google processes data of hundreds of Petabyte (PB), Facebook generates log data of over 10 PB per month, Baidu, a Chinese company, processes data of tens of PB, and Taobao, a subsidiary of Alibaba, Mobile Netw Appl (2014) 19:171–209 DOI 10.1007/s11036-013-0489-0
Transcript
Page 1: Big Data: A Survey Min Chen

Big Data A Survey

Min Chen middot Shiwen Mao middot Yunhao Liu

copy Springer Science+Business Media New York 2014

Abstract In this paper we review the background andstate-of-the-art of big data We first introduce the generalbackground of big data and review related technologiessuch as could computing Internet of Things data centersand Hadoop We then focus on the four phases of the valuechain of big data ie data generation data acquisition datastorage and data analysis For each phase we introduce thegeneral background discuss the technical challenges andreview the latest advances We finally examine the severalrepresentative applications of big data including enterprisemanagement Internet of Things online social networksmedial applications collective intelligence and smart gridThese discussions aim to provide a comprehensive overviewand big-picture to readers of this exciting area This surveyis concluded with a discussion of open problems and futuredirections

Keywords Big data middot Cloud computing middot Internet ofthings middot Data center middot Hadoop middot Smart grid middot Big dataanalysis

M Chen ()School of Computer Science and TechnologyHuazhong University of Science and Technology1037 Luoyu Road Wuhan 430074 Chinae-mail minchen2012husteducn minchenieeeorg

S MaoDepartment of Electrical amp Computer EngineeringAuburn University 200 Broun Hall AuburnAL 36849-5201 USAe-mail smaoieeeorg

Y LiuTNLIST School of Software Tsinghua University Beijing Chinae-mail yunhaogreenorbscom

1 Background

11 Dawn of big data era

Over the past 20 years data has increased in a large scalein various fields According to a report from InternationalData Corporation (IDC) in 2011 the overall created andcopied data volume in the world was 18ZB (asymp 1021B)which increased by nearly nine times within five years [1]This figure will double at least every other two years in thenear future

Under the explosive increase of global data the term ofbig data is mainly used to describe enormous datasets Com-pared with traditional datasets big data typically includesmasses of unstructured data that need more real-time analy-sis In addition big data also brings about new opportunitiesfor discovering new values helps us to gain an in-depthunderstanding of the hidden values and also incurs newchallenges eg how to effectively organize and managesuch datasets

Recently industries become interested in the high poten-tial of big data and many government agencies announcedmajor plans to accelerate big data research and applica-tions [2] In addition issues on big data are often coveredin public media such as The Economist [3 4] New YorkTimes [5] and National Public Radio [6 7] Two pre-mier scientific journals Nature and Science also openedspecial columns to discuss the challenges and impacts ofbig data [8 9] The era of big data has come beyond alldoubt [10]

Nowadays big data related to the service of Internet com-panies grow rapidly For example Google processes data ofhundreds of Petabyte (PB) Facebook generates log data ofover 10 PB per month Baidu a Chinese company processesdata of tens of PB and Taobao a subsidiary of Alibaba

Mobile Netw Appl (2014) 19171ndash209DOI 101007s11036-013-0489-0

generates data of tens of Terabyte (TB) for online tradingper day Figure 1 illustrates the boom of the global data vol-ume While the amount of large datasets is drastically risingit also brings about many challenging problems demandingprompt solutions

ndash The latest advances of information technology (IT)make it more easily to generate data For example onaverage 72 hours of videos are uploaded to YouTubein every minute [11] Therefore we are confronted withthe main challenge of collecting and integrating massivedata from widely distributed data sources

ndash The rapid growth of cloud computing and the Internet ofThings (IoT) further promote the sharp growth of dataCloud computing provides safeguarding access sitesand channels for data asset In the paradigm of IoT sen-sors all over the world are collecting and transmittingdata to be stored and processed in the cloud Such datain both quantity and mutual relations will far surpass

the capacities of the IT architectures and infrastruc-ture of existing enterprises and its realtime requirementwill also greatly stress the available computing capacityThe increasingly growing data cause a problem of howto store and manage such huge heterogeneous datasetswith moderate requirements on hardware and softwareinfrastructure

ndash In consideration of the heterogeneity scalability real-time complexity and privacy of big data we shalleffectively ldquominerdquo the datasets at different levels duringthe analysis modeling visualization and forecastingso as to reveal its intrinsic property and improve thedecision making

12 Definition and features of big data

Big data is an abstract concept Apart from masses of datait also has some other features which determine the differ-ence between itself and ldquomassive datardquo or ldquovery big datardquo

Fig 1 The continuouslyincreasing big data

172 Mobile Netw Appl (2014) 19171ndash209

At present although the importance of big data has beengenerally recognized people still have different opinions onits definition In general big data shall mean the datasetsthat could not be perceived acquired managed and pro-cessed by traditional IT and softwarehardware tools withina tolerable time Because of different concerns scientificand technological enterprises research scholars data ana-lysts and technical practitioners have different definitionsof big data The following definitions may help us have abetter understanding on the profound social economic andtechnological connotations of big data

In 2010 Apache Hadoop defined big data as ldquodatasetswhich could not be captured managed and processed bygeneral computers within an acceptable scoperdquo On the basisof this definition in May 2011 McKinsey amp Company aglobal consulting agency announced Big Data as the nextfrontier for innovation competition and productivity Bigdata shall mean such datasets which could not be acquiredstored and managed by classic database software This def-inition includes two connotations First datasetsrsquo volumesthat conform to the standard of big data are changing andmay grow over time or with technological advances Sec-ond datasetsrsquo volumes that conform to the standard of bigdata in different applications differ from each other Atpresent big data generally ranges from several TB to sev-eral PB [10] From the definition by McKinsey amp Companyit can be seen that the volume of a dataset is not the onlycriterion for big data The increasingly growing data scaleand its management that could not be handled by traditionaldatabase technologies are the next two key features

As a matter of fact big data has been defined as earlyas 2001 Doug Laney an analyst of META (presentlyGartner) defined challenges and opportunities brought aboutby increased data with a 3Vs model ie the increase ofVolume Velocity and Variety in a research report [12]Although such a model was not originally used to definebig data Gartner and many other enterprises includingIBM [13] and some research departments of Microsoft [14]still used the ldquo3Vsrdquo model to describe big data withinthe following ten years [15] In the ldquo3Vsrdquo model Volumemeans with the generation and collection of masses ofdata data scale becomes increasingly big Velocity meansthe timeliness of big data specifically data collection andanalysis etc must be rapidly and timely conducted so asto maximumly utilize the commercial value of big dataVariety indicates the various types of data which includesemi-structured and unstructured data such as audio videowebpage and text as well as traditional structured data

However others have different opinions including IDCone of the most influential leaders in big data and itsresearch fields In 2011 an IDC report defined big data asldquobig data technologies describe a new generation of tech-nologies and architectures designed to economically extract

value from very large volumes of a wide variety of data byenabling the high-velocity capture discovery andor anal-ysisrdquo [1] With this definition characteristics of big datamay be summarized as four Vs ie Volume (great volume)Variety (various modalities) Velocity (rapid generation)and Value (huge value but very low density) as shown inFig 2 Such 4Vs definition was widely recognized sinceit highlights the meaning and necessity of big data ieexploring the huge hidden values This definition indicatesthe most critical problem in big data which is how to dis-cover values from datasets with an enormous scale varioustypes and rapid generation As Jay Parikh Deputy ChiefEngineer of Facebook said ldquoYou could only own a bunchof data other than big data if you do not utilize the collecteddatardquo [11]

In addition NIST defines big data as ldquoBig data shallmean the data of which the data volume acquisition speedor data representation limits the capacity of using traditionalrelational methods to conduct effective analysis or the datawhich may be effectively processed with important horizon-tal zoom technologiesrdquo which focuses on the technologicalaspect of big data It indicates that efficient methods ortechnologies need to be developed and used to analyze andprocess big data

There have been considerable discussions from bothindustry and academia on the definition of big data [16 17]In addition to developing a proper definition the big dataresearch should also focus on how to extract its value howto use data and how to transform ldquoa bunch of datardquo into ldquobigdatardquo

13 Big data value

McKinsey amp Company observed how big data created val-ues after in-depth research on the US healthcare the EUpublic sector administration the US retail the global man-ufacturing and the global personal location data Throughresearch on the five core industries that represent the globaleconomy the McKinsey report pointed out that big datamay give a full play to the economic function improve theproductivity and competitiveness of enterprises and publicsectors and create huge benefits for consumers In [10]McKinsey summarized the values that big data could cre-ate if big data could be creatively and effectively utilizedto improve efficiency and quality the potential value ofthe US medical industry gained through data may surpassUSD 300 billion thus reducing the expenditure for the UShealthcare by over 8 retailers that fully utilize big datamay improve their profit by more than 60 big data mayalso be utilized to improve the efficiency of governmentoperations such that the developed economies in Europecould save over EUR 100 billion (which excludes the effectof reduced frauds errors and tax difference)

Mobile Netw Appl (2014) 19171ndash209 173

Fig 2 The 4Vs feature of big data

The McKinsey report is regarded as prospective andpredictive while the following facts may validate the val-ues of big data During the 2009 flu pandemic Googleobtained timely information by analyzing big data whicheven provided more valuable information than that providedby disease prevention centers Nearly all countries requiredhospitals inform agencies such as disease prevention centersof the new type of influenza cases However patients usu-ally did not see doctors immediately when they got infectedIt also took some time to send information from hospitals todisease prevention centers and for disease prevention cen-ters to analyze and summarize such information Thereforewhen the public is aware of the pandemic of the new typeof influenza the disease may have already spread for one totwo weeks with a hysteretic nature Google found that dur-ing the spreading of influenza entries frequently sought atits search engines would be different from those at ordinarytimes and the use frequencies of the entries were corre-lated to the influenza spreading in both time and locationGoogle found 45 search entry groups that were closely rel-evant to the outbreak of influenza and incorporated themin specific mathematic models to forecast the spreading ofinfluenza and even to predict places where influenza spreadfrom The related research results have been published inNature [18]

In 2008 Microsoft purchased Farecast a sci-tech venturecompany in the US Farecast has an airline ticket forecastsystem that predicts the trends and risingdropping ranges ofairline ticket price The system has been incorporated intothe Bing search engine of Microsoft By 2012 the systemhas saved nearly USD 50 per ticket per passenger with theforecasted accuracy as high as 75

At present data has become an important production fac-tor that could be comparable to material assets and humancapital As multimedia social media and IoT are devel-oping enterprises will collect more information leading

to an exponential growth of data volume Big data willhave a huge and increasing potential in creating values forbusinesses and consumers

14 The development of big data

In the late 1970s the concept of ldquodatabase machinerdquoemerged which is a technology specially used for stor-ing and analyzing data With the increase of data volumethe storage and processing capacity of a single mainframecomputer system became inadequate In the 1980s peo-ple proposed ldquoshare nothingrdquo a parallel database system tomeet the demand of the increasing data volume [19] Theshare nothing system architecture is based on the use ofcluster and every machine has its own processor storageand disk Teradata system was the first successful com-mercial parallel database system Such database becamevery popular lately On June 2 1986 a milestone eventoccurred when Teradata delivered the first parallel databasesystem with the storage capacity of 1TB to Kmart to helpthe large-scale retail company in North America to expandits data warehouse [20] In the late 1990s the advantagesof parallel database was widely recognized in the databasefield

However many challenges on big data arose With thedevelopment of Internet servies indexes and queried con-tents were rapidly growing Therefore search engine com-panies had to face the challenges of handling such big dataGoogle created GFS [21] and MapReduce [22] program-ming models to cope with the challenges brought aboutby data management and analysis at the Internet scale Inaddition contents generated by users sensors and otherubiquitous data sources also feuled the overwhelming dataflows which required a fundamental change on the comput-ing architecture and large-scale data processing mechanismIn January 2007 Jim Gray a pioneer of database software

174 Mobile Netw Appl (2014) 19171ndash209

called such transformation ldquoThe Fourth Paradigmrdquo [23] Healso thought the only way to cope with such paradigm wasto develop a new generation of computing tools to managevisualize and analyze massive data In June 2011 anothermilestone event occurred EMCIDC published a researchreport titled Extracting Values from Chaos [1] which intro-duced the concept and potential of big data for the firsttime This research report triggered the great interest in bothindustry and academia on big data

Over the past few years nearly all major companiesincluding EMC Oracle IBM Microsoft Google Ama-zon and Facebook etc have started their big data projectsTaking IBM as an example since 2005 IBM has investedUSD 16 billion on 30 acquisitions related to big data Inacademia big data was also under the spotlight In 2008Nature published a big data special issue In 2011 Sciencealso launched a special issue on the key technologies ofldquodata processingrdquo in big data In 2012 European ResearchConsortium for Informatics and Mathematics (ERCIM)News published a special issue on big data In the beginningof 2012 a report titled Big Data Big Impact presented at theDavos Forum in Switzerland announced that big data hasbecome a new kind of economic assets just like currencyor gold Gartner an international research agency issuedHype Cycles from 2012 to 2013 which classified big datacomputing social analysis and stored data analysis into 48emerging technologies that deserve most attention

Many national governments such as the US also paidgreat attention to big data In March 2012 the ObamaAdministration announced a USD 200 million investmentto launch the ldquoBig Data Research and Development Planrdquowhich was a second major scientific and technologicaldevelopment initiative after the ldquoInformation Highwayrdquo ini-tiative in 1993 In July 2012 the ldquoVigorous ICT Japanrdquoproject issued by Japanrsquos Ministry of Internal Affairs andCommunications indicated that the big data developmentshould be a national strategy and application technologiesshould be the focus In July 2012 the United Nations issuedBig Data for Development report which summarized howgovernments utilized big data to better serve and protecttheir people

15 Challenges of big data

The sharply increasing data deluge in the big data erabrings about huge challenges on data acquisition storagemanagement and analysis Traditional data managementand analysis systems are based on the relational databasemanagement system (RDBMS) However such RDBMSsonly apply to structured data other than semi-structured orunstructured data In addition RDBMSs are increasinglyutilizing more and more expensive hardware It is appar-ently that the traditional RDBMSs could not handle the

huge volume and heterogeneity of big data The researchcommunity has proposed some solutions from different per-spectives For example cloud computing is utilized to meetthe requirements on infrastructure for big data eg costefficiency elasticity and smooth upgradingdowngradingFor solutions of permanent storage and management oflarge-scale disordered datasets distributed file systems [24]and NoSQL [25] databases are good choices Such program-ming frameworks have achieved great success in processingclustered tasks especially for webpage ranking Various bigdata applications can be developed based on these innova-tive technologies or platforms Moreover it is non-trivial todeploy the big data analysis systems

Some literature [26ndash28] discuss obstacles in the develop-ment of big data applications The key challenges are listedas follows

ndash Data representation many datasets have certain levelsof heterogeneity in type structure semantics organiza-tion granularity and accessibility Data representationaims to make data more meaningful for computer anal-ysis and user interpretation Nevertheless an improperdata representation will reduce the value of the origi-nal data and may even obstruct effective data analysisEfficient data representation shall reflect data structureclass and type as well as integrated technologies so asto enable efficient operations on different datasets

ndash Redundancy reduction and data compression gener-ally there is a high level of redundancy in datasetsRedundancy reduction and data compression is effec-tive to reduce the indirect cost of the entire system onthe premise that the potential values of the data are notaffected For example most data generated by sensornetworks are highly redundant which may be filteredand compressed at orders of magnitude

ndash Data life cycle management compared with the rel-atively slow advances of storage systems pervasivesensing and computing are generating data at unprece-dented rates and scales We are confronted with a lotof pressing challenges one of which is that the currentstorage system could not support such massive dataGenerally speaking values hidden in big data dependon data freshness Therefore a data importance princi-ple related to the analytical value should be developedto decide which data shall be stored and which datashall be discarded

ndash Analytical mechanism the analytical system of big datashall process masses of heterogeneous data within alimited time However traditional RDBMSs are strictlydesigned with a lack of scalability and expandabilitywhich could not meet the performance requirementsNon-relational databases have shown their uniqueadvantages in the processing of unstructured data and

Mobile Netw Appl (2014) 19171ndash209 175

started to become mainstream in big data analysisEven so there are still some problems of non-relationaldatabases in their performance and particular applica-tions We shall find a compromising solution betweenRDBMSs and non-relational databases For examplesome enterprises have utilized a mixed database archi-tecture that integrates the advantages of both types ofdatabase (eg Facebook and Taobao) More researchis needed on the in-memory database and sample databased on approximate analysis

ndash Data confidentiality most big data service providers orowners at present could not effectively maintain andanalyze such huge datasets because of their limitedcapacity They must rely on professionals or tools toanalyze such data which increase the potential safetyrisks For example the transactional dataset generallyincludes a set of complete operating data to drive keybusiness processes Such data contains details of thelowest granularity and some sensitive information suchas credit card numbers Therefore analysis of big datamay be delivered to a third party for processing onlywhen proper preventive measures are taken to protectsuch sensitive data to ensure its safety

ndash Energy management the energy consumption of main-frame computing systems has drawn much attentionfrom both economy and environment perspectives Withthe increase of data volume and analytical demandsthe processing storage and transmission of big datawill inevitably consume more and more electric energyTherefore system-level power consumption controland management mechanism shall be established forbig data while the expandability and accessibility areensured

ndash Expendability and scalability the analytical system ofbig data must support present and future datasets Theanalytical algorithm must be able to process increas-ingly expanding and more complex datasets

ndash Cooperation analysis of big data is an interdisci-plinary research which requires experts in differentfields cooperate to harvest the potential of big dataA comprehensive big data network architecture mustbe established to help scientists and engineers in var-ious fields access different kinds of data and fullyutilize their expertise so as to cooperate to complete theanalytical objectives

2 Related technologies

In order to gain a deep understanding of big data this sec-tion will introduce several fundamental technologies that areclosely related to big data including cloud computing IoTdata center and Hadoop

21 Relationship between cloud computing and big data

Cloud computing is closely related to big data The keycomponents of cloud computing are shown in Fig 3 Bigdata is the object of the computation-intensive operation andstresses the storage capacity of a cloud system The mainobjective of cloud computing is to use huge computing andstorage resources under concentrated management so asto provide big data applications with fine-grained comput-ing capacity The development of cloud computing providessolutions for the storage and processing of big data On theother hand the emergence of big data also accelerates thedevelopment of cloud computing The distributed storagetechnology based on cloud computing can effectively man-age big data the parallel computing capacity by virtue ofcloud computing can improve the efficiency of acquisitionand analyzing big data

Even though there are many overlapped technologiesin cloud computing and big data they differ in the fol-lowing two aspects First the concepts are different to acertain extent Cloud computing transforms the IT archi-tecture while big data influences business decision-makingHowever big data depends on cloud computing as thefundamental infrastructure for smooth operation

Second big data and cloud computing have differenttarget customers Cloud computing is a technology andproduct targeting Chief Information Officers (CIO) as anadvanced IT solution Big data is a product targeting ChiefExecutive Officers (CEO) focusing on business operationsSince the decision makers may directly feel the pressurefrom market competition they must defeat business oppo-nents in more competitive ways With the advances ofbig data and cloud computing these two technologies arecertainly and increasingly entwine with each other Cloudcomputing with functions similar to those of computers andoperating systems provides system-level resources big data

Fig 3 Key components of cloud computing

176 Mobile Netw Appl (2014) 19171ndash209

operates in the upper level supported by cloud computingand provides functions similar to those of database and effi-cient data processing capacity Kissinger President of EMCindicated that the application of big data must be based oncloud computing

The evolution of big data was driven by the rapid growthof application demands and cloud computing developedfrom virtualized technologies Therefore cloud computingnot only provides computation and processing for big databut also itself is a service mode To a certain extent theadvances of cloud computing also promote the developmentof big data both of which supplement each other

22 Relationship between IoT and big data

In the IoT paradigm an enormous amount of networkingsensors are embedded into various devices and machinesin the real world Such sensors deployed in different fieldsmay collect various kinds of data such as environmentaldata geographical data astronomical data and logistic dataMobile equipments transportation facilities public facil-ities and home appliances could all be data acquisitionequipments in IoT as illustrated in Fig 4

The big data generated by IoT has different characteris-tics compared with general big data because of the differenttypes of data collected of which the most classical charac-teristics include heterogeneity variety unstructured featurenoise and high redundancy Although the current IoT datais not the dominant part of big data by 2030 the quantity of

sensors will reach one trillion and then the IoT data will be

the most important part of big data according to the fore-

cast of HP A report from Intel pointed out that big data in

IoT has three features that conform to the big data paradigm

(i) abundant terminals generating masses of data (ii) data

generated by IoT is usually semi-structured or unstructured

(iii) data of IoT is useful only when it is analyzed

At present the data processing capacity of IoT has fallen

behind the collected data and it is extremely urgent to accel-

erate the introduction of big data technologies to promote

the development of IoT Many operators of IoT realize the

importance of big data since the success of IoT is hinged

upon the effective integration of big data and cloud com-

puting The widespread deployment of IoT will also bring

many cities into the big data era

There is a compelling need to adopt big data for IoT

applications while the development of big data is already

legged behind It has been widely recognized that these

two technologies are inter-dependent and should be jointly

developed on one hand the widespread deployment of IoT

drives the high growth of data both in quantity and cate-

gory thus providing the opportunity for the application and

development of big data on the other hand the application

of big data technology to IoT also accelerates the research

advances and business models of of IoT

Fig 4 Illustration of data acquisition equipment in IoT

Mobile Netw Appl (2014) 19171ndash209 177

23 Data center

In the big data paradigm the data center not only is a plat-form for concentrated storage of data but also undertakesmore responsibilities such as acquiring data managingdata organizing data and leveraging the data values andfunctions Data centers mainly concern ldquodatardquo other thanldquocenterrdquo It has masses of data and organizes and man-ages data according to its core objective and developmentpath which is more valuable than owning a good site andresource The emergence of big data brings about sounddevelopment opportunities and great challenges to data cen-ters Big data is an emerging paradigm which will promotethe explosive growth of the infrastructure and related soft-ware of data center The physical data center network isthe core for supporting big data but at present is the keyinfrastructure that is most urgently required [29]

ndash Big data requires data center provide powerful back-stage support The big data paradigm has more strin-gent requirements on storage capacity and processingcapacity as well as network transmission capacityEnterprises must take the development of data centersinto consideration to improve the capacity of rapidlyand effectively processing of big data under limitedpriceperformance ratio The data center shall providethe infrastructure with a large number of nodes build ahigh-speed internal network effectively dissipate heatand effective backup data Only when a highly energy-efficient stable safe expandable and redundant datacenter is built the normal operation of big data applica-tions may be ensured

ndash The growth of big data applications accelerates therevolution and innovation of data centers Many bigdata applications have developed their unique architec-tures and directly promote the development of storagenetwork and computing technologies related to datacenter With the continued growth of the volumes ofstructured and unstructured data and the variety ofsources of analytical data the data processing and com-puting capacities of the data center shall be greatlyenhanced In addition as the scale of data center isincreasingly expanding it is also an important issue onhow to reduce the operational cost for the developmentof data centers

ndash Big data endows more functions to the data center Inthe big data paradigm data center shall not only con-cern with hardware facilities but also strengthen softcapacities ie the capacities of acquisition processingorganization analysis and application of big data Thedata center may help business personnel analyze theexisting data discover problems in business operationand develop solutions from big data

24 Relationship between hadoop and big data

Presently Hadoop is widely used in big data applications inthe industry eg spam filtering network searching click-stream analysis and social recommendation In additionconsiderable academic research is now based on HadoopSome representative cases are given below As declaredin June 2012 Yahoo runs Hadoop in 42000 servers atfour data centers to support its products and services egsearching and spam filtering etc At present the biggestHadoop cluster has 4000 nodes but the number of nodeswill be increased to 10000 with the release of Hadoop 20In the same month Facebook announced that their Hadoopcluster can process 100 PB data which grew by 05 PB perday as in November 2012 Some well-known agencies thatuse Hadoop to conduct distributed computation are listedin [30] In addition many companies provide Hadoop com-mercial execution andor support including Cloudera IBMMapR EMC and Oracle

Among modern industrial machinery and systems sen-sors are widely deployed to collect information for environ-ment monitoring and failure forecasting etc Bahga and oth-ers in [31] proposed a framework for data organization andcloud computing infrastructure termed CloudView Cloud-View uses mixed architectures local nodes and remoteclusters based on Hadoop to analyze machine-generateddata Local nodes are used for the forecast of real-time fail-ures clusters based on Hadoop are used for complex offlineanalysis eg case-driven data analysis

The exponential growth of the genome data and the sharpdrop of sequencing cost transform bio-science and bio-medicine to data-driven science Gunarathne et al in [32]utilized cloud computing infrastructures Amazon AWSMicrosoft Azune and data processing framework basedon MapReduce Hadoop and Microsoft DryadLINQ torun two parallel bio-medicine applications (i) assembly ofgenome segments (ii) dimension reduction in the analy-sis of chemical structure In the subsequent application the166-D datasets used include 26000000 data points Theauthors compared the performance of all the frameworks interms of efficiency cost and availability According to thestudy the authors concluded that the loose coupling will beincreasingly applied to research on electron cloud and theparallel programming technology (MapReduce) frameworkmay provide the user an interface with more convenientservices and reduce unnecessary costs

3 Big data generation and acquisition

We have introduced several key technologies related to bigdata ie cloud computing IoT data center and HadoopNext we will focus on the value chain of big data which

178 Mobile Netw Appl (2014) 19171ndash209

can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

31 Data generation

Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

311 Enterprise data

In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

312 IoT data

As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

According to characteristics of Internet of Things thedata generated from IoT has the following features

ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

Mobile Netw Appl (2014) 19171ndash209 179

and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

313 Bio-medical data

As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

314 Data generation from other fields

As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

32 Big data acquisition

As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

180 Mobile Netw Appl (2014) 19171ndash209

useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

321 Data collection

Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

Mobile Netw Appl (2014) 19171ndash209 181

ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

and collection methods recording through other auxiliarytools

322 Data transportation

Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

182 Mobile Netw Appl (2014) 19171ndash209

mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

323 Data pre-processing

Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

Mobile Netw Appl (2014) 19171ndash209 183

in e-commerce by crawlers and regularly re-copyingcustomer and account information

In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

On generalized data transmission or storage re-peated data deletion is a special data compression

technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

4 Big data storage

The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

184 Mobile Netw Appl (2014) 19171ndash209

41 Storage system for massive data

Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

42 Distributed storage system

The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

system to store massive data the following factors shouldbe taken into consideration

ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

Mobile Netw Appl (2014) 19171ndash209 185

level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

43 Storage mechanism for big data

Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

431 Database technology

The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

186 Mobile Netw Appl (2014) 19171ndash209

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 2: Big Data: A Survey Min Chen

generates data of tens of Terabyte (TB) for online tradingper day Figure 1 illustrates the boom of the global data vol-ume While the amount of large datasets is drastically risingit also brings about many challenging problems demandingprompt solutions

ndash The latest advances of information technology (IT)make it more easily to generate data For example onaverage 72 hours of videos are uploaded to YouTubein every minute [11] Therefore we are confronted withthe main challenge of collecting and integrating massivedata from widely distributed data sources

ndash The rapid growth of cloud computing and the Internet ofThings (IoT) further promote the sharp growth of dataCloud computing provides safeguarding access sitesand channels for data asset In the paradigm of IoT sen-sors all over the world are collecting and transmittingdata to be stored and processed in the cloud Such datain both quantity and mutual relations will far surpass

the capacities of the IT architectures and infrastruc-ture of existing enterprises and its realtime requirementwill also greatly stress the available computing capacityThe increasingly growing data cause a problem of howto store and manage such huge heterogeneous datasetswith moderate requirements on hardware and softwareinfrastructure

ndash In consideration of the heterogeneity scalability real-time complexity and privacy of big data we shalleffectively ldquominerdquo the datasets at different levels duringthe analysis modeling visualization and forecastingso as to reveal its intrinsic property and improve thedecision making

12 Definition and features of big data

Big data is an abstract concept Apart from masses of datait also has some other features which determine the differ-ence between itself and ldquomassive datardquo or ldquovery big datardquo

Fig 1 The continuouslyincreasing big data

172 Mobile Netw Appl (2014) 19171ndash209

At present although the importance of big data has beengenerally recognized people still have different opinions onits definition In general big data shall mean the datasetsthat could not be perceived acquired managed and pro-cessed by traditional IT and softwarehardware tools withina tolerable time Because of different concerns scientificand technological enterprises research scholars data ana-lysts and technical practitioners have different definitionsof big data The following definitions may help us have abetter understanding on the profound social economic andtechnological connotations of big data

In 2010 Apache Hadoop defined big data as ldquodatasetswhich could not be captured managed and processed bygeneral computers within an acceptable scoperdquo On the basisof this definition in May 2011 McKinsey amp Company aglobal consulting agency announced Big Data as the nextfrontier for innovation competition and productivity Bigdata shall mean such datasets which could not be acquiredstored and managed by classic database software This def-inition includes two connotations First datasetsrsquo volumesthat conform to the standard of big data are changing andmay grow over time or with technological advances Sec-ond datasetsrsquo volumes that conform to the standard of bigdata in different applications differ from each other Atpresent big data generally ranges from several TB to sev-eral PB [10] From the definition by McKinsey amp Companyit can be seen that the volume of a dataset is not the onlycriterion for big data The increasingly growing data scaleand its management that could not be handled by traditionaldatabase technologies are the next two key features

As a matter of fact big data has been defined as earlyas 2001 Doug Laney an analyst of META (presentlyGartner) defined challenges and opportunities brought aboutby increased data with a 3Vs model ie the increase ofVolume Velocity and Variety in a research report [12]Although such a model was not originally used to definebig data Gartner and many other enterprises includingIBM [13] and some research departments of Microsoft [14]still used the ldquo3Vsrdquo model to describe big data withinthe following ten years [15] In the ldquo3Vsrdquo model Volumemeans with the generation and collection of masses ofdata data scale becomes increasingly big Velocity meansthe timeliness of big data specifically data collection andanalysis etc must be rapidly and timely conducted so asto maximumly utilize the commercial value of big dataVariety indicates the various types of data which includesemi-structured and unstructured data such as audio videowebpage and text as well as traditional structured data

However others have different opinions including IDCone of the most influential leaders in big data and itsresearch fields In 2011 an IDC report defined big data asldquobig data technologies describe a new generation of tech-nologies and architectures designed to economically extract

value from very large volumes of a wide variety of data byenabling the high-velocity capture discovery andor anal-ysisrdquo [1] With this definition characteristics of big datamay be summarized as four Vs ie Volume (great volume)Variety (various modalities) Velocity (rapid generation)and Value (huge value but very low density) as shown inFig 2 Such 4Vs definition was widely recognized sinceit highlights the meaning and necessity of big data ieexploring the huge hidden values This definition indicatesthe most critical problem in big data which is how to dis-cover values from datasets with an enormous scale varioustypes and rapid generation As Jay Parikh Deputy ChiefEngineer of Facebook said ldquoYou could only own a bunchof data other than big data if you do not utilize the collecteddatardquo [11]

In addition NIST defines big data as ldquoBig data shallmean the data of which the data volume acquisition speedor data representation limits the capacity of using traditionalrelational methods to conduct effective analysis or the datawhich may be effectively processed with important horizon-tal zoom technologiesrdquo which focuses on the technologicalaspect of big data It indicates that efficient methods ortechnologies need to be developed and used to analyze andprocess big data

There have been considerable discussions from bothindustry and academia on the definition of big data [16 17]In addition to developing a proper definition the big dataresearch should also focus on how to extract its value howto use data and how to transform ldquoa bunch of datardquo into ldquobigdatardquo

13 Big data value

McKinsey amp Company observed how big data created val-ues after in-depth research on the US healthcare the EUpublic sector administration the US retail the global man-ufacturing and the global personal location data Throughresearch on the five core industries that represent the globaleconomy the McKinsey report pointed out that big datamay give a full play to the economic function improve theproductivity and competitiveness of enterprises and publicsectors and create huge benefits for consumers In [10]McKinsey summarized the values that big data could cre-ate if big data could be creatively and effectively utilizedto improve efficiency and quality the potential value ofthe US medical industry gained through data may surpassUSD 300 billion thus reducing the expenditure for the UShealthcare by over 8 retailers that fully utilize big datamay improve their profit by more than 60 big data mayalso be utilized to improve the efficiency of governmentoperations such that the developed economies in Europecould save over EUR 100 billion (which excludes the effectof reduced frauds errors and tax difference)

Mobile Netw Appl (2014) 19171ndash209 173

Fig 2 The 4Vs feature of big data

The McKinsey report is regarded as prospective andpredictive while the following facts may validate the val-ues of big data During the 2009 flu pandemic Googleobtained timely information by analyzing big data whicheven provided more valuable information than that providedby disease prevention centers Nearly all countries requiredhospitals inform agencies such as disease prevention centersof the new type of influenza cases However patients usu-ally did not see doctors immediately when they got infectedIt also took some time to send information from hospitals todisease prevention centers and for disease prevention cen-ters to analyze and summarize such information Thereforewhen the public is aware of the pandemic of the new typeof influenza the disease may have already spread for one totwo weeks with a hysteretic nature Google found that dur-ing the spreading of influenza entries frequently sought atits search engines would be different from those at ordinarytimes and the use frequencies of the entries were corre-lated to the influenza spreading in both time and locationGoogle found 45 search entry groups that were closely rel-evant to the outbreak of influenza and incorporated themin specific mathematic models to forecast the spreading ofinfluenza and even to predict places where influenza spreadfrom The related research results have been published inNature [18]

In 2008 Microsoft purchased Farecast a sci-tech venturecompany in the US Farecast has an airline ticket forecastsystem that predicts the trends and risingdropping ranges ofairline ticket price The system has been incorporated intothe Bing search engine of Microsoft By 2012 the systemhas saved nearly USD 50 per ticket per passenger with theforecasted accuracy as high as 75

At present data has become an important production fac-tor that could be comparable to material assets and humancapital As multimedia social media and IoT are devel-oping enterprises will collect more information leading

to an exponential growth of data volume Big data willhave a huge and increasing potential in creating values forbusinesses and consumers

14 The development of big data

In the late 1970s the concept of ldquodatabase machinerdquoemerged which is a technology specially used for stor-ing and analyzing data With the increase of data volumethe storage and processing capacity of a single mainframecomputer system became inadequate In the 1980s peo-ple proposed ldquoshare nothingrdquo a parallel database system tomeet the demand of the increasing data volume [19] Theshare nothing system architecture is based on the use ofcluster and every machine has its own processor storageand disk Teradata system was the first successful com-mercial parallel database system Such database becamevery popular lately On June 2 1986 a milestone eventoccurred when Teradata delivered the first parallel databasesystem with the storage capacity of 1TB to Kmart to helpthe large-scale retail company in North America to expandits data warehouse [20] In the late 1990s the advantagesof parallel database was widely recognized in the databasefield

However many challenges on big data arose With thedevelopment of Internet servies indexes and queried con-tents were rapidly growing Therefore search engine com-panies had to face the challenges of handling such big dataGoogle created GFS [21] and MapReduce [22] program-ming models to cope with the challenges brought aboutby data management and analysis at the Internet scale Inaddition contents generated by users sensors and otherubiquitous data sources also feuled the overwhelming dataflows which required a fundamental change on the comput-ing architecture and large-scale data processing mechanismIn January 2007 Jim Gray a pioneer of database software

174 Mobile Netw Appl (2014) 19171ndash209

called such transformation ldquoThe Fourth Paradigmrdquo [23] Healso thought the only way to cope with such paradigm wasto develop a new generation of computing tools to managevisualize and analyze massive data In June 2011 anothermilestone event occurred EMCIDC published a researchreport titled Extracting Values from Chaos [1] which intro-duced the concept and potential of big data for the firsttime This research report triggered the great interest in bothindustry and academia on big data

Over the past few years nearly all major companiesincluding EMC Oracle IBM Microsoft Google Ama-zon and Facebook etc have started their big data projectsTaking IBM as an example since 2005 IBM has investedUSD 16 billion on 30 acquisitions related to big data Inacademia big data was also under the spotlight In 2008Nature published a big data special issue In 2011 Sciencealso launched a special issue on the key technologies ofldquodata processingrdquo in big data In 2012 European ResearchConsortium for Informatics and Mathematics (ERCIM)News published a special issue on big data In the beginningof 2012 a report titled Big Data Big Impact presented at theDavos Forum in Switzerland announced that big data hasbecome a new kind of economic assets just like currencyor gold Gartner an international research agency issuedHype Cycles from 2012 to 2013 which classified big datacomputing social analysis and stored data analysis into 48emerging technologies that deserve most attention

Many national governments such as the US also paidgreat attention to big data In March 2012 the ObamaAdministration announced a USD 200 million investmentto launch the ldquoBig Data Research and Development Planrdquowhich was a second major scientific and technologicaldevelopment initiative after the ldquoInformation Highwayrdquo ini-tiative in 1993 In July 2012 the ldquoVigorous ICT Japanrdquoproject issued by Japanrsquos Ministry of Internal Affairs andCommunications indicated that the big data developmentshould be a national strategy and application technologiesshould be the focus In July 2012 the United Nations issuedBig Data for Development report which summarized howgovernments utilized big data to better serve and protecttheir people

15 Challenges of big data

The sharply increasing data deluge in the big data erabrings about huge challenges on data acquisition storagemanagement and analysis Traditional data managementand analysis systems are based on the relational databasemanagement system (RDBMS) However such RDBMSsonly apply to structured data other than semi-structured orunstructured data In addition RDBMSs are increasinglyutilizing more and more expensive hardware It is appar-ently that the traditional RDBMSs could not handle the

huge volume and heterogeneity of big data The researchcommunity has proposed some solutions from different per-spectives For example cloud computing is utilized to meetthe requirements on infrastructure for big data eg costefficiency elasticity and smooth upgradingdowngradingFor solutions of permanent storage and management oflarge-scale disordered datasets distributed file systems [24]and NoSQL [25] databases are good choices Such program-ming frameworks have achieved great success in processingclustered tasks especially for webpage ranking Various bigdata applications can be developed based on these innova-tive technologies or platforms Moreover it is non-trivial todeploy the big data analysis systems

Some literature [26ndash28] discuss obstacles in the develop-ment of big data applications The key challenges are listedas follows

ndash Data representation many datasets have certain levelsof heterogeneity in type structure semantics organiza-tion granularity and accessibility Data representationaims to make data more meaningful for computer anal-ysis and user interpretation Nevertheless an improperdata representation will reduce the value of the origi-nal data and may even obstruct effective data analysisEfficient data representation shall reflect data structureclass and type as well as integrated technologies so asto enable efficient operations on different datasets

ndash Redundancy reduction and data compression gener-ally there is a high level of redundancy in datasetsRedundancy reduction and data compression is effec-tive to reduce the indirect cost of the entire system onthe premise that the potential values of the data are notaffected For example most data generated by sensornetworks are highly redundant which may be filteredand compressed at orders of magnitude

ndash Data life cycle management compared with the rel-atively slow advances of storage systems pervasivesensing and computing are generating data at unprece-dented rates and scales We are confronted with a lotof pressing challenges one of which is that the currentstorage system could not support such massive dataGenerally speaking values hidden in big data dependon data freshness Therefore a data importance princi-ple related to the analytical value should be developedto decide which data shall be stored and which datashall be discarded

ndash Analytical mechanism the analytical system of big datashall process masses of heterogeneous data within alimited time However traditional RDBMSs are strictlydesigned with a lack of scalability and expandabilitywhich could not meet the performance requirementsNon-relational databases have shown their uniqueadvantages in the processing of unstructured data and

Mobile Netw Appl (2014) 19171ndash209 175

started to become mainstream in big data analysisEven so there are still some problems of non-relationaldatabases in their performance and particular applica-tions We shall find a compromising solution betweenRDBMSs and non-relational databases For examplesome enterprises have utilized a mixed database archi-tecture that integrates the advantages of both types ofdatabase (eg Facebook and Taobao) More researchis needed on the in-memory database and sample databased on approximate analysis

ndash Data confidentiality most big data service providers orowners at present could not effectively maintain andanalyze such huge datasets because of their limitedcapacity They must rely on professionals or tools toanalyze such data which increase the potential safetyrisks For example the transactional dataset generallyincludes a set of complete operating data to drive keybusiness processes Such data contains details of thelowest granularity and some sensitive information suchas credit card numbers Therefore analysis of big datamay be delivered to a third party for processing onlywhen proper preventive measures are taken to protectsuch sensitive data to ensure its safety

ndash Energy management the energy consumption of main-frame computing systems has drawn much attentionfrom both economy and environment perspectives Withthe increase of data volume and analytical demandsthe processing storage and transmission of big datawill inevitably consume more and more electric energyTherefore system-level power consumption controland management mechanism shall be established forbig data while the expandability and accessibility areensured

ndash Expendability and scalability the analytical system ofbig data must support present and future datasets Theanalytical algorithm must be able to process increas-ingly expanding and more complex datasets

ndash Cooperation analysis of big data is an interdisci-plinary research which requires experts in differentfields cooperate to harvest the potential of big dataA comprehensive big data network architecture mustbe established to help scientists and engineers in var-ious fields access different kinds of data and fullyutilize their expertise so as to cooperate to complete theanalytical objectives

2 Related technologies

In order to gain a deep understanding of big data this sec-tion will introduce several fundamental technologies that areclosely related to big data including cloud computing IoTdata center and Hadoop

21 Relationship between cloud computing and big data

Cloud computing is closely related to big data The keycomponents of cloud computing are shown in Fig 3 Bigdata is the object of the computation-intensive operation andstresses the storage capacity of a cloud system The mainobjective of cloud computing is to use huge computing andstorage resources under concentrated management so asto provide big data applications with fine-grained comput-ing capacity The development of cloud computing providessolutions for the storage and processing of big data On theother hand the emergence of big data also accelerates thedevelopment of cloud computing The distributed storagetechnology based on cloud computing can effectively man-age big data the parallel computing capacity by virtue ofcloud computing can improve the efficiency of acquisitionand analyzing big data

Even though there are many overlapped technologiesin cloud computing and big data they differ in the fol-lowing two aspects First the concepts are different to acertain extent Cloud computing transforms the IT archi-tecture while big data influences business decision-makingHowever big data depends on cloud computing as thefundamental infrastructure for smooth operation

Second big data and cloud computing have differenttarget customers Cloud computing is a technology andproduct targeting Chief Information Officers (CIO) as anadvanced IT solution Big data is a product targeting ChiefExecutive Officers (CEO) focusing on business operationsSince the decision makers may directly feel the pressurefrom market competition they must defeat business oppo-nents in more competitive ways With the advances ofbig data and cloud computing these two technologies arecertainly and increasingly entwine with each other Cloudcomputing with functions similar to those of computers andoperating systems provides system-level resources big data

Fig 3 Key components of cloud computing

176 Mobile Netw Appl (2014) 19171ndash209

operates in the upper level supported by cloud computingand provides functions similar to those of database and effi-cient data processing capacity Kissinger President of EMCindicated that the application of big data must be based oncloud computing

The evolution of big data was driven by the rapid growthof application demands and cloud computing developedfrom virtualized technologies Therefore cloud computingnot only provides computation and processing for big databut also itself is a service mode To a certain extent theadvances of cloud computing also promote the developmentof big data both of which supplement each other

22 Relationship between IoT and big data

In the IoT paradigm an enormous amount of networkingsensors are embedded into various devices and machinesin the real world Such sensors deployed in different fieldsmay collect various kinds of data such as environmentaldata geographical data astronomical data and logistic dataMobile equipments transportation facilities public facil-ities and home appliances could all be data acquisitionequipments in IoT as illustrated in Fig 4

The big data generated by IoT has different characteris-tics compared with general big data because of the differenttypes of data collected of which the most classical charac-teristics include heterogeneity variety unstructured featurenoise and high redundancy Although the current IoT datais not the dominant part of big data by 2030 the quantity of

sensors will reach one trillion and then the IoT data will be

the most important part of big data according to the fore-

cast of HP A report from Intel pointed out that big data in

IoT has three features that conform to the big data paradigm

(i) abundant terminals generating masses of data (ii) data

generated by IoT is usually semi-structured or unstructured

(iii) data of IoT is useful only when it is analyzed

At present the data processing capacity of IoT has fallen

behind the collected data and it is extremely urgent to accel-

erate the introduction of big data technologies to promote

the development of IoT Many operators of IoT realize the

importance of big data since the success of IoT is hinged

upon the effective integration of big data and cloud com-

puting The widespread deployment of IoT will also bring

many cities into the big data era

There is a compelling need to adopt big data for IoT

applications while the development of big data is already

legged behind It has been widely recognized that these

two technologies are inter-dependent and should be jointly

developed on one hand the widespread deployment of IoT

drives the high growth of data both in quantity and cate-

gory thus providing the opportunity for the application and

development of big data on the other hand the application

of big data technology to IoT also accelerates the research

advances and business models of of IoT

Fig 4 Illustration of data acquisition equipment in IoT

Mobile Netw Appl (2014) 19171ndash209 177

23 Data center

In the big data paradigm the data center not only is a plat-form for concentrated storage of data but also undertakesmore responsibilities such as acquiring data managingdata organizing data and leveraging the data values andfunctions Data centers mainly concern ldquodatardquo other thanldquocenterrdquo It has masses of data and organizes and man-ages data according to its core objective and developmentpath which is more valuable than owning a good site andresource The emergence of big data brings about sounddevelopment opportunities and great challenges to data cen-ters Big data is an emerging paradigm which will promotethe explosive growth of the infrastructure and related soft-ware of data center The physical data center network isthe core for supporting big data but at present is the keyinfrastructure that is most urgently required [29]

ndash Big data requires data center provide powerful back-stage support The big data paradigm has more strin-gent requirements on storage capacity and processingcapacity as well as network transmission capacityEnterprises must take the development of data centersinto consideration to improve the capacity of rapidlyand effectively processing of big data under limitedpriceperformance ratio The data center shall providethe infrastructure with a large number of nodes build ahigh-speed internal network effectively dissipate heatand effective backup data Only when a highly energy-efficient stable safe expandable and redundant datacenter is built the normal operation of big data applica-tions may be ensured

ndash The growth of big data applications accelerates therevolution and innovation of data centers Many bigdata applications have developed their unique architec-tures and directly promote the development of storagenetwork and computing technologies related to datacenter With the continued growth of the volumes ofstructured and unstructured data and the variety ofsources of analytical data the data processing and com-puting capacities of the data center shall be greatlyenhanced In addition as the scale of data center isincreasingly expanding it is also an important issue onhow to reduce the operational cost for the developmentof data centers

ndash Big data endows more functions to the data center Inthe big data paradigm data center shall not only con-cern with hardware facilities but also strengthen softcapacities ie the capacities of acquisition processingorganization analysis and application of big data Thedata center may help business personnel analyze theexisting data discover problems in business operationand develop solutions from big data

24 Relationship between hadoop and big data

Presently Hadoop is widely used in big data applications inthe industry eg spam filtering network searching click-stream analysis and social recommendation In additionconsiderable academic research is now based on HadoopSome representative cases are given below As declaredin June 2012 Yahoo runs Hadoop in 42000 servers atfour data centers to support its products and services egsearching and spam filtering etc At present the biggestHadoop cluster has 4000 nodes but the number of nodeswill be increased to 10000 with the release of Hadoop 20In the same month Facebook announced that their Hadoopcluster can process 100 PB data which grew by 05 PB perday as in November 2012 Some well-known agencies thatuse Hadoop to conduct distributed computation are listedin [30] In addition many companies provide Hadoop com-mercial execution andor support including Cloudera IBMMapR EMC and Oracle

Among modern industrial machinery and systems sen-sors are widely deployed to collect information for environ-ment monitoring and failure forecasting etc Bahga and oth-ers in [31] proposed a framework for data organization andcloud computing infrastructure termed CloudView Cloud-View uses mixed architectures local nodes and remoteclusters based on Hadoop to analyze machine-generateddata Local nodes are used for the forecast of real-time fail-ures clusters based on Hadoop are used for complex offlineanalysis eg case-driven data analysis

The exponential growth of the genome data and the sharpdrop of sequencing cost transform bio-science and bio-medicine to data-driven science Gunarathne et al in [32]utilized cloud computing infrastructures Amazon AWSMicrosoft Azune and data processing framework basedon MapReduce Hadoop and Microsoft DryadLINQ torun two parallel bio-medicine applications (i) assembly ofgenome segments (ii) dimension reduction in the analy-sis of chemical structure In the subsequent application the166-D datasets used include 26000000 data points Theauthors compared the performance of all the frameworks interms of efficiency cost and availability According to thestudy the authors concluded that the loose coupling will beincreasingly applied to research on electron cloud and theparallel programming technology (MapReduce) frameworkmay provide the user an interface with more convenientservices and reduce unnecessary costs

3 Big data generation and acquisition

We have introduced several key technologies related to bigdata ie cloud computing IoT data center and HadoopNext we will focus on the value chain of big data which

178 Mobile Netw Appl (2014) 19171ndash209

can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

31 Data generation

Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

311 Enterprise data

In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

312 IoT data

As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

According to characteristics of Internet of Things thedata generated from IoT has the following features

ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

Mobile Netw Appl (2014) 19171ndash209 179

and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

313 Bio-medical data

As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

314 Data generation from other fields

As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

32 Big data acquisition

As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

180 Mobile Netw Appl (2014) 19171ndash209

useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

321 Data collection

Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

Mobile Netw Appl (2014) 19171ndash209 181

ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

and collection methods recording through other auxiliarytools

322 Data transportation

Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

182 Mobile Netw Appl (2014) 19171ndash209

mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

323 Data pre-processing

Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

Mobile Netw Appl (2014) 19171ndash209 183

in e-commerce by crawlers and regularly re-copyingcustomer and account information

In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

On generalized data transmission or storage re-peated data deletion is a special data compression

technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

4 Big data storage

The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

184 Mobile Netw Appl (2014) 19171ndash209

41 Storage system for massive data

Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

42 Distributed storage system

The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

system to store massive data the following factors shouldbe taken into consideration

ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

Mobile Netw Appl (2014) 19171ndash209 185

level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

43 Storage mechanism for big data

Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

431 Database technology

The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

186 Mobile Netw Appl (2014) 19171ndash209

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 3: Big Data: A Survey Min Chen

At present although the importance of big data has beengenerally recognized people still have different opinions onits definition In general big data shall mean the datasetsthat could not be perceived acquired managed and pro-cessed by traditional IT and softwarehardware tools withina tolerable time Because of different concerns scientificand technological enterprises research scholars data ana-lysts and technical practitioners have different definitionsof big data The following definitions may help us have abetter understanding on the profound social economic andtechnological connotations of big data

In 2010 Apache Hadoop defined big data as ldquodatasetswhich could not be captured managed and processed bygeneral computers within an acceptable scoperdquo On the basisof this definition in May 2011 McKinsey amp Company aglobal consulting agency announced Big Data as the nextfrontier for innovation competition and productivity Bigdata shall mean such datasets which could not be acquiredstored and managed by classic database software This def-inition includes two connotations First datasetsrsquo volumesthat conform to the standard of big data are changing andmay grow over time or with technological advances Sec-ond datasetsrsquo volumes that conform to the standard of bigdata in different applications differ from each other Atpresent big data generally ranges from several TB to sev-eral PB [10] From the definition by McKinsey amp Companyit can be seen that the volume of a dataset is not the onlycriterion for big data The increasingly growing data scaleand its management that could not be handled by traditionaldatabase technologies are the next two key features

As a matter of fact big data has been defined as earlyas 2001 Doug Laney an analyst of META (presentlyGartner) defined challenges and opportunities brought aboutby increased data with a 3Vs model ie the increase ofVolume Velocity and Variety in a research report [12]Although such a model was not originally used to definebig data Gartner and many other enterprises includingIBM [13] and some research departments of Microsoft [14]still used the ldquo3Vsrdquo model to describe big data withinthe following ten years [15] In the ldquo3Vsrdquo model Volumemeans with the generation and collection of masses ofdata data scale becomes increasingly big Velocity meansthe timeliness of big data specifically data collection andanalysis etc must be rapidly and timely conducted so asto maximumly utilize the commercial value of big dataVariety indicates the various types of data which includesemi-structured and unstructured data such as audio videowebpage and text as well as traditional structured data

However others have different opinions including IDCone of the most influential leaders in big data and itsresearch fields In 2011 an IDC report defined big data asldquobig data technologies describe a new generation of tech-nologies and architectures designed to economically extract

value from very large volumes of a wide variety of data byenabling the high-velocity capture discovery andor anal-ysisrdquo [1] With this definition characteristics of big datamay be summarized as four Vs ie Volume (great volume)Variety (various modalities) Velocity (rapid generation)and Value (huge value but very low density) as shown inFig 2 Such 4Vs definition was widely recognized sinceit highlights the meaning and necessity of big data ieexploring the huge hidden values This definition indicatesthe most critical problem in big data which is how to dis-cover values from datasets with an enormous scale varioustypes and rapid generation As Jay Parikh Deputy ChiefEngineer of Facebook said ldquoYou could only own a bunchof data other than big data if you do not utilize the collecteddatardquo [11]

In addition NIST defines big data as ldquoBig data shallmean the data of which the data volume acquisition speedor data representation limits the capacity of using traditionalrelational methods to conduct effective analysis or the datawhich may be effectively processed with important horizon-tal zoom technologiesrdquo which focuses on the technologicalaspect of big data It indicates that efficient methods ortechnologies need to be developed and used to analyze andprocess big data

There have been considerable discussions from bothindustry and academia on the definition of big data [16 17]In addition to developing a proper definition the big dataresearch should also focus on how to extract its value howto use data and how to transform ldquoa bunch of datardquo into ldquobigdatardquo

13 Big data value

McKinsey amp Company observed how big data created val-ues after in-depth research on the US healthcare the EUpublic sector administration the US retail the global man-ufacturing and the global personal location data Throughresearch on the five core industries that represent the globaleconomy the McKinsey report pointed out that big datamay give a full play to the economic function improve theproductivity and competitiveness of enterprises and publicsectors and create huge benefits for consumers In [10]McKinsey summarized the values that big data could cre-ate if big data could be creatively and effectively utilizedto improve efficiency and quality the potential value ofthe US medical industry gained through data may surpassUSD 300 billion thus reducing the expenditure for the UShealthcare by over 8 retailers that fully utilize big datamay improve their profit by more than 60 big data mayalso be utilized to improve the efficiency of governmentoperations such that the developed economies in Europecould save over EUR 100 billion (which excludes the effectof reduced frauds errors and tax difference)

Mobile Netw Appl (2014) 19171ndash209 173

Fig 2 The 4Vs feature of big data

The McKinsey report is regarded as prospective andpredictive while the following facts may validate the val-ues of big data During the 2009 flu pandemic Googleobtained timely information by analyzing big data whicheven provided more valuable information than that providedby disease prevention centers Nearly all countries requiredhospitals inform agencies such as disease prevention centersof the new type of influenza cases However patients usu-ally did not see doctors immediately when they got infectedIt also took some time to send information from hospitals todisease prevention centers and for disease prevention cen-ters to analyze and summarize such information Thereforewhen the public is aware of the pandemic of the new typeof influenza the disease may have already spread for one totwo weeks with a hysteretic nature Google found that dur-ing the spreading of influenza entries frequently sought atits search engines would be different from those at ordinarytimes and the use frequencies of the entries were corre-lated to the influenza spreading in both time and locationGoogle found 45 search entry groups that were closely rel-evant to the outbreak of influenza and incorporated themin specific mathematic models to forecast the spreading ofinfluenza and even to predict places where influenza spreadfrom The related research results have been published inNature [18]

In 2008 Microsoft purchased Farecast a sci-tech venturecompany in the US Farecast has an airline ticket forecastsystem that predicts the trends and risingdropping ranges ofairline ticket price The system has been incorporated intothe Bing search engine of Microsoft By 2012 the systemhas saved nearly USD 50 per ticket per passenger with theforecasted accuracy as high as 75

At present data has become an important production fac-tor that could be comparable to material assets and humancapital As multimedia social media and IoT are devel-oping enterprises will collect more information leading

to an exponential growth of data volume Big data willhave a huge and increasing potential in creating values forbusinesses and consumers

14 The development of big data

In the late 1970s the concept of ldquodatabase machinerdquoemerged which is a technology specially used for stor-ing and analyzing data With the increase of data volumethe storage and processing capacity of a single mainframecomputer system became inadequate In the 1980s peo-ple proposed ldquoshare nothingrdquo a parallel database system tomeet the demand of the increasing data volume [19] Theshare nothing system architecture is based on the use ofcluster and every machine has its own processor storageand disk Teradata system was the first successful com-mercial parallel database system Such database becamevery popular lately On June 2 1986 a milestone eventoccurred when Teradata delivered the first parallel databasesystem with the storage capacity of 1TB to Kmart to helpthe large-scale retail company in North America to expandits data warehouse [20] In the late 1990s the advantagesof parallel database was widely recognized in the databasefield

However many challenges on big data arose With thedevelopment of Internet servies indexes and queried con-tents were rapidly growing Therefore search engine com-panies had to face the challenges of handling such big dataGoogle created GFS [21] and MapReduce [22] program-ming models to cope with the challenges brought aboutby data management and analysis at the Internet scale Inaddition contents generated by users sensors and otherubiquitous data sources also feuled the overwhelming dataflows which required a fundamental change on the comput-ing architecture and large-scale data processing mechanismIn January 2007 Jim Gray a pioneer of database software

174 Mobile Netw Appl (2014) 19171ndash209

called such transformation ldquoThe Fourth Paradigmrdquo [23] Healso thought the only way to cope with such paradigm wasto develop a new generation of computing tools to managevisualize and analyze massive data In June 2011 anothermilestone event occurred EMCIDC published a researchreport titled Extracting Values from Chaos [1] which intro-duced the concept and potential of big data for the firsttime This research report triggered the great interest in bothindustry and academia on big data

Over the past few years nearly all major companiesincluding EMC Oracle IBM Microsoft Google Ama-zon and Facebook etc have started their big data projectsTaking IBM as an example since 2005 IBM has investedUSD 16 billion on 30 acquisitions related to big data Inacademia big data was also under the spotlight In 2008Nature published a big data special issue In 2011 Sciencealso launched a special issue on the key technologies ofldquodata processingrdquo in big data In 2012 European ResearchConsortium for Informatics and Mathematics (ERCIM)News published a special issue on big data In the beginningof 2012 a report titled Big Data Big Impact presented at theDavos Forum in Switzerland announced that big data hasbecome a new kind of economic assets just like currencyor gold Gartner an international research agency issuedHype Cycles from 2012 to 2013 which classified big datacomputing social analysis and stored data analysis into 48emerging technologies that deserve most attention

Many national governments such as the US also paidgreat attention to big data In March 2012 the ObamaAdministration announced a USD 200 million investmentto launch the ldquoBig Data Research and Development Planrdquowhich was a second major scientific and technologicaldevelopment initiative after the ldquoInformation Highwayrdquo ini-tiative in 1993 In July 2012 the ldquoVigorous ICT Japanrdquoproject issued by Japanrsquos Ministry of Internal Affairs andCommunications indicated that the big data developmentshould be a national strategy and application technologiesshould be the focus In July 2012 the United Nations issuedBig Data for Development report which summarized howgovernments utilized big data to better serve and protecttheir people

15 Challenges of big data

The sharply increasing data deluge in the big data erabrings about huge challenges on data acquisition storagemanagement and analysis Traditional data managementand analysis systems are based on the relational databasemanagement system (RDBMS) However such RDBMSsonly apply to structured data other than semi-structured orunstructured data In addition RDBMSs are increasinglyutilizing more and more expensive hardware It is appar-ently that the traditional RDBMSs could not handle the

huge volume and heterogeneity of big data The researchcommunity has proposed some solutions from different per-spectives For example cloud computing is utilized to meetthe requirements on infrastructure for big data eg costefficiency elasticity and smooth upgradingdowngradingFor solutions of permanent storage and management oflarge-scale disordered datasets distributed file systems [24]and NoSQL [25] databases are good choices Such program-ming frameworks have achieved great success in processingclustered tasks especially for webpage ranking Various bigdata applications can be developed based on these innova-tive technologies or platforms Moreover it is non-trivial todeploy the big data analysis systems

Some literature [26ndash28] discuss obstacles in the develop-ment of big data applications The key challenges are listedas follows

ndash Data representation many datasets have certain levelsof heterogeneity in type structure semantics organiza-tion granularity and accessibility Data representationaims to make data more meaningful for computer anal-ysis and user interpretation Nevertheless an improperdata representation will reduce the value of the origi-nal data and may even obstruct effective data analysisEfficient data representation shall reflect data structureclass and type as well as integrated technologies so asto enable efficient operations on different datasets

ndash Redundancy reduction and data compression gener-ally there is a high level of redundancy in datasetsRedundancy reduction and data compression is effec-tive to reduce the indirect cost of the entire system onthe premise that the potential values of the data are notaffected For example most data generated by sensornetworks are highly redundant which may be filteredand compressed at orders of magnitude

ndash Data life cycle management compared with the rel-atively slow advances of storage systems pervasivesensing and computing are generating data at unprece-dented rates and scales We are confronted with a lotof pressing challenges one of which is that the currentstorage system could not support such massive dataGenerally speaking values hidden in big data dependon data freshness Therefore a data importance princi-ple related to the analytical value should be developedto decide which data shall be stored and which datashall be discarded

ndash Analytical mechanism the analytical system of big datashall process masses of heterogeneous data within alimited time However traditional RDBMSs are strictlydesigned with a lack of scalability and expandabilitywhich could not meet the performance requirementsNon-relational databases have shown their uniqueadvantages in the processing of unstructured data and

Mobile Netw Appl (2014) 19171ndash209 175

started to become mainstream in big data analysisEven so there are still some problems of non-relationaldatabases in their performance and particular applica-tions We shall find a compromising solution betweenRDBMSs and non-relational databases For examplesome enterprises have utilized a mixed database archi-tecture that integrates the advantages of both types ofdatabase (eg Facebook and Taobao) More researchis needed on the in-memory database and sample databased on approximate analysis

ndash Data confidentiality most big data service providers orowners at present could not effectively maintain andanalyze such huge datasets because of their limitedcapacity They must rely on professionals or tools toanalyze such data which increase the potential safetyrisks For example the transactional dataset generallyincludes a set of complete operating data to drive keybusiness processes Such data contains details of thelowest granularity and some sensitive information suchas credit card numbers Therefore analysis of big datamay be delivered to a third party for processing onlywhen proper preventive measures are taken to protectsuch sensitive data to ensure its safety

ndash Energy management the energy consumption of main-frame computing systems has drawn much attentionfrom both economy and environment perspectives Withthe increase of data volume and analytical demandsthe processing storage and transmission of big datawill inevitably consume more and more electric energyTherefore system-level power consumption controland management mechanism shall be established forbig data while the expandability and accessibility areensured

ndash Expendability and scalability the analytical system ofbig data must support present and future datasets Theanalytical algorithm must be able to process increas-ingly expanding and more complex datasets

ndash Cooperation analysis of big data is an interdisci-plinary research which requires experts in differentfields cooperate to harvest the potential of big dataA comprehensive big data network architecture mustbe established to help scientists and engineers in var-ious fields access different kinds of data and fullyutilize their expertise so as to cooperate to complete theanalytical objectives

2 Related technologies

In order to gain a deep understanding of big data this sec-tion will introduce several fundamental technologies that areclosely related to big data including cloud computing IoTdata center and Hadoop

21 Relationship between cloud computing and big data

Cloud computing is closely related to big data The keycomponents of cloud computing are shown in Fig 3 Bigdata is the object of the computation-intensive operation andstresses the storage capacity of a cloud system The mainobjective of cloud computing is to use huge computing andstorage resources under concentrated management so asto provide big data applications with fine-grained comput-ing capacity The development of cloud computing providessolutions for the storage and processing of big data On theother hand the emergence of big data also accelerates thedevelopment of cloud computing The distributed storagetechnology based on cloud computing can effectively man-age big data the parallel computing capacity by virtue ofcloud computing can improve the efficiency of acquisitionand analyzing big data

Even though there are many overlapped technologiesin cloud computing and big data they differ in the fol-lowing two aspects First the concepts are different to acertain extent Cloud computing transforms the IT archi-tecture while big data influences business decision-makingHowever big data depends on cloud computing as thefundamental infrastructure for smooth operation

Second big data and cloud computing have differenttarget customers Cloud computing is a technology andproduct targeting Chief Information Officers (CIO) as anadvanced IT solution Big data is a product targeting ChiefExecutive Officers (CEO) focusing on business operationsSince the decision makers may directly feel the pressurefrom market competition they must defeat business oppo-nents in more competitive ways With the advances ofbig data and cloud computing these two technologies arecertainly and increasingly entwine with each other Cloudcomputing with functions similar to those of computers andoperating systems provides system-level resources big data

Fig 3 Key components of cloud computing

176 Mobile Netw Appl (2014) 19171ndash209

operates in the upper level supported by cloud computingand provides functions similar to those of database and effi-cient data processing capacity Kissinger President of EMCindicated that the application of big data must be based oncloud computing

The evolution of big data was driven by the rapid growthof application demands and cloud computing developedfrom virtualized technologies Therefore cloud computingnot only provides computation and processing for big databut also itself is a service mode To a certain extent theadvances of cloud computing also promote the developmentof big data both of which supplement each other

22 Relationship between IoT and big data

In the IoT paradigm an enormous amount of networkingsensors are embedded into various devices and machinesin the real world Such sensors deployed in different fieldsmay collect various kinds of data such as environmentaldata geographical data astronomical data and logistic dataMobile equipments transportation facilities public facil-ities and home appliances could all be data acquisitionequipments in IoT as illustrated in Fig 4

The big data generated by IoT has different characteris-tics compared with general big data because of the differenttypes of data collected of which the most classical charac-teristics include heterogeneity variety unstructured featurenoise and high redundancy Although the current IoT datais not the dominant part of big data by 2030 the quantity of

sensors will reach one trillion and then the IoT data will be

the most important part of big data according to the fore-

cast of HP A report from Intel pointed out that big data in

IoT has three features that conform to the big data paradigm

(i) abundant terminals generating masses of data (ii) data

generated by IoT is usually semi-structured or unstructured

(iii) data of IoT is useful only when it is analyzed

At present the data processing capacity of IoT has fallen

behind the collected data and it is extremely urgent to accel-

erate the introduction of big data technologies to promote

the development of IoT Many operators of IoT realize the

importance of big data since the success of IoT is hinged

upon the effective integration of big data and cloud com-

puting The widespread deployment of IoT will also bring

many cities into the big data era

There is a compelling need to adopt big data for IoT

applications while the development of big data is already

legged behind It has been widely recognized that these

two technologies are inter-dependent and should be jointly

developed on one hand the widespread deployment of IoT

drives the high growth of data both in quantity and cate-

gory thus providing the opportunity for the application and

development of big data on the other hand the application

of big data technology to IoT also accelerates the research

advances and business models of of IoT

Fig 4 Illustration of data acquisition equipment in IoT

Mobile Netw Appl (2014) 19171ndash209 177

23 Data center

In the big data paradigm the data center not only is a plat-form for concentrated storage of data but also undertakesmore responsibilities such as acquiring data managingdata organizing data and leveraging the data values andfunctions Data centers mainly concern ldquodatardquo other thanldquocenterrdquo It has masses of data and organizes and man-ages data according to its core objective and developmentpath which is more valuable than owning a good site andresource The emergence of big data brings about sounddevelopment opportunities and great challenges to data cen-ters Big data is an emerging paradigm which will promotethe explosive growth of the infrastructure and related soft-ware of data center The physical data center network isthe core for supporting big data but at present is the keyinfrastructure that is most urgently required [29]

ndash Big data requires data center provide powerful back-stage support The big data paradigm has more strin-gent requirements on storage capacity and processingcapacity as well as network transmission capacityEnterprises must take the development of data centersinto consideration to improve the capacity of rapidlyand effectively processing of big data under limitedpriceperformance ratio The data center shall providethe infrastructure with a large number of nodes build ahigh-speed internal network effectively dissipate heatand effective backup data Only when a highly energy-efficient stable safe expandable and redundant datacenter is built the normal operation of big data applica-tions may be ensured

ndash The growth of big data applications accelerates therevolution and innovation of data centers Many bigdata applications have developed their unique architec-tures and directly promote the development of storagenetwork and computing technologies related to datacenter With the continued growth of the volumes ofstructured and unstructured data and the variety ofsources of analytical data the data processing and com-puting capacities of the data center shall be greatlyenhanced In addition as the scale of data center isincreasingly expanding it is also an important issue onhow to reduce the operational cost for the developmentof data centers

ndash Big data endows more functions to the data center Inthe big data paradigm data center shall not only con-cern with hardware facilities but also strengthen softcapacities ie the capacities of acquisition processingorganization analysis and application of big data Thedata center may help business personnel analyze theexisting data discover problems in business operationand develop solutions from big data

24 Relationship between hadoop and big data

Presently Hadoop is widely used in big data applications inthe industry eg spam filtering network searching click-stream analysis and social recommendation In additionconsiderable academic research is now based on HadoopSome representative cases are given below As declaredin June 2012 Yahoo runs Hadoop in 42000 servers atfour data centers to support its products and services egsearching and spam filtering etc At present the biggestHadoop cluster has 4000 nodes but the number of nodeswill be increased to 10000 with the release of Hadoop 20In the same month Facebook announced that their Hadoopcluster can process 100 PB data which grew by 05 PB perday as in November 2012 Some well-known agencies thatuse Hadoop to conduct distributed computation are listedin [30] In addition many companies provide Hadoop com-mercial execution andor support including Cloudera IBMMapR EMC and Oracle

Among modern industrial machinery and systems sen-sors are widely deployed to collect information for environ-ment monitoring and failure forecasting etc Bahga and oth-ers in [31] proposed a framework for data organization andcloud computing infrastructure termed CloudView Cloud-View uses mixed architectures local nodes and remoteclusters based on Hadoop to analyze machine-generateddata Local nodes are used for the forecast of real-time fail-ures clusters based on Hadoop are used for complex offlineanalysis eg case-driven data analysis

The exponential growth of the genome data and the sharpdrop of sequencing cost transform bio-science and bio-medicine to data-driven science Gunarathne et al in [32]utilized cloud computing infrastructures Amazon AWSMicrosoft Azune and data processing framework basedon MapReduce Hadoop and Microsoft DryadLINQ torun two parallel bio-medicine applications (i) assembly ofgenome segments (ii) dimension reduction in the analy-sis of chemical structure In the subsequent application the166-D datasets used include 26000000 data points Theauthors compared the performance of all the frameworks interms of efficiency cost and availability According to thestudy the authors concluded that the loose coupling will beincreasingly applied to research on electron cloud and theparallel programming technology (MapReduce) frameworkmay provide the user an interface with more convenientservices and reduce unnecessary costs

3 Big data generation and acquisition

We have introduced several key technologies related to bigdata ie cloud computing IoT data center and HadoopNext we will focus on the value chain of big data which

178 Mobile Netw Appl (2014) 19171ndash209

can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

31 Data generation

Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

311 Enterprise data

In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

312 IoT data

As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

According to characteristics of Internet of Things thedata generated from IoT has the following features

ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

Mobile Netw Appl (2014) 19171ndash209 179

and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

313 Bio-medical data

As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

314 Data generation from other fields

As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

32 Big data acquisition

As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

180 Mobile Netw Appl (2014) 19171ndash209

useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

321 Data collection

Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

Mobile Netw Appl (2014) 19171ndash209 181

ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

and collection methods recording through other auxiliarytools

322 Data transportation

Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

182 Mobile Netw Appl (2014) 19171ndash209

mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

323 Data pre-processing

Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

Mobile Netw Appl (2014) 19171ndash209 183

in e-commerce by crawlers and regularly re-copyingcustomer and account information

In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

On generalized data transmission or storage re-peated data deletion is a special data compression

technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

4 Big data storage

The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

184 Mobile Netw Appl (2014) 19171ndash209

41 Storage system for massive data

Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

42 Distributed storage system

The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

system to store massive data the following factors shouldbe taken into consideration

ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

Mobile Netw Appl (2014) 19171ndash209 185

level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

43 Storage mechanism for big data

Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

431 Database technology

The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

186 Mobile Netw Appl (2014) 19171ndash209

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 4: Big Data: A Survey Min Chen

Fig 2 The 4Vs feature of big data

The McKinsey report is regarded as prospective andpredictive while the following facts may validate the val-ues of big data During the 2009 flu pandemic Googleobtained timely information by analyzing big data whicheven provided more valuable information than that providedby disease prevention centers Nearly all countries requiredhospitals inform agencies such as disease prevention centersof the new type of influenza cases However patients usu-ally did not see doctors immediately when they got infectedIt also took some time to send information from hospitals todisease prevention centers and for disease prevention cen-ters to analyze and summarize such information Thereforewhen the public is aware of the pandemic of the new typeof influenza the disease may have already spread for one totwo weeks with a hysteretic nature Google found that dur-ing the spreading of influenza entries frequently sought atits search engines would be different from those at ordinarytimes and the use frequencies of the entries were corre-lated to the influenza spreading in both time and locationGoogle found 45 search entry groups that were closely rel-evant to the outbreak of influenza and incorporated themin specific mathematic models to forecast the spreading ofinfluenza and even to predict places where influenza spreadfrom The related research results have been published inNature [18]

In 2008 Microsoft purchased Farecast a sci-tech venturecompany in the US Farecast has an airline ticket forecastsystem that predicts the trends and risingdropping ranges ofairline ticket price The system has been incorporated intothe Bing search engine of Microsoft By 2012 the systemhas saved nearly USD 50 per ticket per passenger with theforecasted accuracy as high as 75

At present data has become an important production fac-tor that could be comparable to material assets and humancapital As multimedia social media and IoT are devel-oping enterprises will collect more information leading

to an exponential growth of data volume Big data willhave a huge and increasing potential in creating values forbusinesses and consumers

14 The development of big data

In the late 1970s the concept of ldquodatabase machinerdquoemerged which is a technology specially used for stor-ing and analyzing data With the increase of data volumethe storage and processing capacity of a single mainframecomputer system became inadequate In the 1980s peo-ple proposed ldquoshare nothingrdquo a parallel database system tomeet the demand of the increasing data volume [19] Theshare nothing system architecture is based on the use ofcluster and every machine has its own processor storageand disk Teradata system was the first successful com-mercial parallel database system Such database becamevery popular lately On June 2 1986 a milestone eventoccurred when Teradata delivered the first parallel databasesystem with the storage capacity of 1TB to Kmart to helpthe large-scale retail company in North America to expandits data warehouse [20] In the late 1990s the advantagesof parallel database was widely recognized in the databasefield

However many challenges on big data arose With thedevelopment of Internet servies indexes and queried con-tents were rapidly growing Therefore search engine com-panies had to face the challenges of handling such big dataGoogle created GFS [21] and MapReduce [22] program-ming models to cope with the challenges brought aboutby data management and analysis at the Internet scale Inaddition contents generated by users sensors and otherubiquitous data sources also feuled the overwhelming dataflows which required a fundamental change on the comput-ing architecture and large-scale data processing mechanismIn January 2007 Jim Gray a pioneer of database software

174 Mobile Netw Appl (2014) 19171ndash209

called such transformation ldquoThe Fourth Paradigmrdquo [23] Healso thought the only way to cope with such paradigm wasto develop a new generation of computing tools to managevisualize and analyze massive data In June 2011 anothermilestone event occurred EMCIDC published a researchreport titled Extracting Values from Chaos [1] which intro-duced the concept and potential of big data for the firsttime This research report triggered the great interest in bothindustry and academia on big data

Over the past few years nearly all major companiesincluding EMC Oracle IBM Microsoft Google Ama-zon and Facebook etc have started their big data projectsTaking IBM as an example since 2005 IBM has investedUSD 16 billion on 30 acquisitions related to big data Inacademia big data was also under the spotlight In 2008Nature published a big data special issue In 2011 Sciencealso launched a special issue on the key technologies ofldquodata processingrdquo in big data In 2012 European ResearchConsortium for Informatics and Mathematics (ERCIM)News published a special issue on big data In the beginningof 2012 a report titled Big Data Big Impact presented at theDavos Forum in Switzerland announced that big data hasbecome a new kind of economic assets just like currencyor gold Gartner an international research agency issuedHype Cycles from 2012 to 2013 which classified big datacomputing social analysis and stored data analysis into 48emerging technologies that deserve most attention

Many national governments such as the US also paidgreat attention to big data In March 2012 the ObamaAdministration announced a USD 200 million investmentto launch the ldquoBig Data Research and Development Planrdquowhich was a second major scientific and technologicaldevelopment initiative after the ldquoInformation Highwayrdquo ini-tiative in 1993 In July 2012 the ldquoVigorous ICT Japanrdquoproject issued by Japanrsquos Ministry of Internal Affairs andCommunications indicated that the big data developmentshould be a national strategy and application technologiesshould be the focus In July 2012 the United Nations issuedBig Data for Development report which summarized howgovernments utilized big data to better serve and protecttheir people

15 Challenges of big data

The sharply increasing data deluge in the big data erabrings about huge challenges on data acquisition storagemanagement and analysis Traditional data managementand analysis systems are based on the relational databasemanagement system (RDBMS) However such RDBMSsonly apply to structured data other than semi-structured orunstructured data In addition RDBMSs are increasinglyutilizing more and more expensive hardware It is appar-ently that the traditional RDBMSs could not handle the

huge volume and heterogeneity of big data The researchcommunity has proposed some solutions from different per-spectives For example cloud computing is utilized to meetthe requirements on infrastructure for big data eg costefficiency elasticity and smooth upgradingdowngradingFor solutions of permanent storage and management oflarge-scale disordered datasets distributed file systems [24]and NoSQL [25] databases are good choices Such program-ming frameworks have achieved great success in processingclustered tasks especially for webpage ranking Various bigdata applications can be developed based on these innova-tive technologies or platforms Moreover it is non-trivial todeploy the big data analysis systems

Some literature [26ndash28] discuss obstacles in the develop-ment of big data applications The key challenges are listedas follows

ndash Data representation many datasets have certain levelsof heterogeneity in type structure semantics organiza-tion granularity and accessibility Data representationaims to make data more meaningful for computer anal-ysis and user interpretation Nevertheless an improperdata representation will reduce the value of the origi-nal data and may even obstruct effective data analysisEfficient data representation shall reflect data structureclass and type as well as integrated technologies so asto enable efficient operations on different datasets

ndash Redundancy reduction and data compression gener-ally there is a high level of redundancy in datasetsRedundancy reduction and data compression is effec-tive to reduce the indirect cost of the entire system onthe premise that the potential values of the data are notaffected For example most data generated by sensornetworks are highly redundant which may be filteredand compressed at orders of magnitude

ndash Data life cycle management compared with the rel-atively slow advances of storage systems pervasivesensing and computing are generating data at unprece-dented rates and scales We are confronted with a lotof pressing challenges one of which is that the currentstorage system could not support such massive dataGenerally speaking values hidden in big data dependon data freshness Therefore a data importance princi-ple related to the analytical value should be developedto decide which data shall be stored and which datashall be discarded

ndash Analytical mechanism the analytical system of big datashall process masses of heterogeneous data within alimited time However traditional RDBMSs are strictlydesigned with a lack of scalability and expandabilitywhich could not meet the performance requirementsNon-relational databases have shown their uniqueadvantages in the processing of unstructured data and

Mobile Netw Appl (2014) 19171ndash209 175

started to become mainstream in big data analysisEven so there are still some problems of non-relationaldatabases in their performance and particular applica-tions We shall find a compromising solution betweenRDBMSs and non-relational databases For examplesome enterprises have utilized a mixed database archi-tecture that integrates the advantages of both types ofdatabase (eg Facebook and Taobao) More researchis needed on the in-memory database and sample databased on approximate analysis

ndash Data confidentiality most big data service providers orowners at present could not effectively maintain andanalyze such huge datasets because of their limitedcapacity They must rely on professionals or tools toanalyze such data which increase the potential safetyrisks For example the transactional dataset generallyincludes a set of complete operating data to drive keybusiness processes Such data contains details of thelowest granularity and some sensitive information suchas credit card numbers Therefore analysis of big datamay be delivered to a third party for processing onlywhen proper preventive measures are taken to protectsuch sensitive data to ensure its safety

ndash Energy management the energy consumption of main-frame computing systems has drawn much attentionfrom both economy and environment perspectives Withthe increase of data volume and analytical demandsthe processing storage and transmission of big datawill inevitably consume more and more electric energyTherefore system-level power consumption controland management mechanism shall be established forbig data while the expandability and accessibility areensured

ndash Expendability and scalability the analytical system ofbig data must support present and future datasets Theanalytical algorithm must be able to process increas-ingly expanding and more complex datasets

ndash Cooperation analysis of big data is an interdisci-plinary research which requires experts in differentfields cooperate to harvest the potential of big dataA comprehensive big data network architecture mustbe established to help scientists and engineers in var-ious fields access different kinds of data and fullyutilize their expertise so as to cooperate to complete theanalytical objectives

2 Related technologies

In order to gain a deep understanding of big data this sec-tion will introduce several fundamental technologies that areclosely related to big data including cloud computing IoTdata center and Hadoop

21 Relationship between cloud computing and big data

Cloud computing is closely related to big data The keycomponents of cloud computing are shown in Fig 3 Bigdata is the object of the computation-intensive operation andstresses the storage capacity of a cloud system The mainobjective of cloud computing is to use huge computing andstorage resources under concentrated management so asto provide big data applications with fine-grained comput-ing capacity The development of cloud computing providessolutions for the storage and processing of big data On theother hand the emergence of big data also accelerates thedevelopment of cloud computing The distributed storagetechnology based on cloud computing can effectively man-age big data the parallel computing capacity by virtue ofcloud computing can improve the efficiency of acquisitionand analyzing big data

Even though there are many overlapped technologiesin cloud computing and big data they differ in the fol-lowing two aspects First the concepts are different to acertain extent Cloud computing transforms the IT archi-tecture while big data influences business decision-makingHowever big data depends on cloud computing as thefundamental infrastructure for smooth operation

Second big data and cloud computing have differenttarget customers Cloud computing is a technology andproduct targeting Chief Information Officers (CIO) as anadvanced IT solution Big data is a product targeting ChiefExecutive Officers (CEO) focusing on business operationsSince the decision makers may directly feel the pressurefrom market competition they must defeat business oppo-nents in more competitive ways With the advances ofbig data and cloud computing these two technologies arecertainly and increasingly entwine with each other Cloudcomputing with functions similar to those of computers andoperating systems provides system-level resources big data

Fig 3 Key components of cloud computing

176 Mobile Netw Appl (2014) 19171ndash209

operates in the upper level supported by cloud computingand provides functions similar to those of database and effi-cient data processing capacity Kissinger President of EMCindicated that the application of big data must be based oncloud computing

The evolution of big data was driven by the rapid growthof application demands and cloud computing developedfrom virtualized technologies Therefore cloud computingnot only provides computation and processing for big databut also itself is a service mode To a certain extent theadvances of cloud computing also promote the developmentof big data both of which supplement each other

22 Relationship between IoT and big data

In the IoT paradigm an enormous amount of networkingsensors are embedded into various devices and machinesin the real world Such sensors deployed in different fieldsmay collect various kinds of data such as environmentaldata geographical data astronomical data and logistic dataMobile equipments transportation facilities public facil-ities and home appliances could all be data acquisitionequipments in IoT as illustrated in Fig 4

The big data generated by IoT has different characteris-tics compared with general big data because of the differenttypes of data collected of which the most classical charac-teristics include heterogeneity variety unstructured featurenoise and high redundancy Although the current IoT datais not the dominant part of big data by 2030 the quantity of

sensors will reach one trillion and then the IoT data will be

the most important part of big data according to the fore-

cast of HP A report from Intel pointed out that big data in

IoT has three features that conform to the big data paradigm

(i) abundant terminals generating masses of data (ii) data

generated by IoT is usually semi-structured or unstructured

(iii) data of IoT is useful only when it is analyzed

At present the data processing capacity of IoT has fallen

behind the collected data and it is extremely urgent to accel-

erate the introduction of big data technologies to promote

the development of IoT Many operators of IoT realize the

importance of big data since the success of IoT is hinged

upon the effective integration of big data and cloud com-

puting The widespread deployment of IoT will also bring

many cities into the big data era

There is a compelling need to adopt big data for IoT

applications while the development of big data is already

legged behind It has been widely recognized that these

two technologies are inter-dependent and should be jointly

developed on one hand the widespread deployment of IoT

drives the high growth of data both in quantity and cate-

gory thus providing the opportunity for the application and

development of big data on the other hand the application

of big data technology to IoT also accelerates the research

advances and business models of of IoT

Fig 4 Illustration of data acquisition equipment in IoT

Mobile Netw Appl (2014) 19171ndash209 177

23 Data center

In the big data paradigm the data center not only is a plat-form for concentrated storage of data but also undertakesmore responsibilities such as acquiring data managingdata organizing data and leveraging the data values andfunctions Data centers mainly concern ldquodatardquo other thanldquocenterrdquo It has masses of data and organizes and man-ages data according to its core objective and developmentpath which is more valuable than owning a good site andresource The emergence of big data brings about sounddevelopment opportunities and great challenges to data cen-ters Big data is an emerging paradigm which will promotethe explosive growth of the infrastructure and related soft-ware of data center The physical data center network isthe core for supporting big data but at present is the keyinfrastructure that is most urgently required [29]

ndash Big data requires data center provide powerful back-stage support The big data paradigm has more strin-gent requirements on storage capacity and processingcapacity as well as network transmission capacityEnterprises must take the development of data centersinto consideration to improve the capacity of rapidlyand effectively processing of big data under limitedpriceperformance ratio The data center shall providethe infrastructure with a large number of nodes build ahigh-speed internal network effectively dissipate heatand effective backup data Only when a highly energy-efficient stable safe expandable and redundant datacenter is built the normal operation of big data applica-tions may be ensured

ndash The growth of big data applications accelerates therevolution and innovation of data centers Many bigdata applications have developed their unique architec-tures and directly promote the development of storagenetwork and computing technologies related to datacenter With the continued growth of the volumes ofstructured and unstructured data and the variety ofsources of analytical data the data processing and com-puting capacities of the data center shall be greatlyenhanced In addition as the scale of data center isincreasingly expanding it is also an important issue onhow to reduce the operational cost for the developmentof data centers

ndash Big data endows more functions to the data center Inthe big data paradigm data center shall not only con-cern with hardware facilities but also strengthen softcapacities ie the capacities of acquisition processingorganization analysis and application of big data Thedata center may help business personnel analyze theexisting data discover problems in business operationand develop solutions from big data

24 Relationship between hadoop and big data

Presently Hadoop is widely used in big data applications inthe industry eg spam filtering network searching click-stream analysis and social recommendation In additionconsiderable academic research is now based on HadoopSome representative cases are given below As declaredin June 2012 Yahoo runs Hadoop in 42000 servers atfour data centers to support its products and services egsearching and spam filtering etc At present the biggestHadoop cluster has 4000 nodes but the number of nodeswill be increased to 10000 with the release of Hadoop 20In the same month Facebook announced that their Hadoopcluster can process 100 PB data which grew by 05 PB perday as in November 2012 Some well-known agencies thatuse Hadoop to conduct distributed computation are listedin [30] In addition many companies provide Hadoop com-mercial execution andor support including Cloudera IBMMapR EMC and Oracle

Among modern industrial machinery and systems sen-sors are widely deployed to collect information for environ-ment monitoring and failure forecasting etc Bahga and oth-ers in [31] proposed a framework for data organization andcloud computing infrastructure termed CloudView Cloud-View uses mixed architectures local nodes and remoteclusters based on Hadoop to analyze machine-generateddata Local nodes are used for the forecast of real-time fail-ures clusters based on Hadoop are used for complex offlineanalysis eg case-driven data analysis

The exponential growth of the genome data and the sharpdrop of sequencing cost transform bio-science and bio-medicine to data-driven science Gunarathne et al in [32]utilized cloud computing infrastructures Amazon AWSMicrosoft Azune and data processing framework basedon MapReduce Hadoop and Microsoft DryadLINQ torun two parallel bio-medicine applications (i) assembly ofgenome segments (ii) dimension reduction in the analy-sis of chemical structure In the subsequent application the166-D datasets used include 26000000 data points Theauthors compared the performance of all the frameworks interms of efficiency cost and availability According to thestudy the authors concluded that the loose coupling will beincreasingly applied to research on electron cloud and theparallel programming technology (MapReduce) frameworkmay provide the user an interface with more convenientservices and reduce unnecessary costs

3 Big data generation and acquisition

We have introduced several key technologies related to bigdata ie cloud computing IoT data center and HadoopNext we will focus on the value chain of big data which

178 Mobile Netw Appl (2014) 19171ndash209

can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

31 Data generation

Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

311 Enterprise data

In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

312 IoT data

As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

According to characteristics of Internet of Things thedata generated from IoT has the following features

ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

Mobile Netw Appl (2014) 19171ndash209 179

and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

313 Bio-medical data

As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

314 Data generation from other fields

As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

32 Big data acquisition

As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

180 Mobile Netw Appl (2014) 19171ndash209

useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

321 Data collection

Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

Mobile Netw Appl (2014) 19171ndash209 181

ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

and collection methods recording through other auxiliarytools

322 Data transportation

Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

182 Mobile Netw Appl (2014) 19171ndash209

mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

323 Data pre-processing

Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

Mobile Netw Appl (2014) 19171ndash209 183

in e-commerce by crawlers and regularly re-copyingcustomer and account information

In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

On generalized data transmission or storage re-peated data deletion is a special data compression

technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

4 Big data storage

The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

184 Mobile Netw Appl (2014) 19171ndash209

41 Storage system for massive data

Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

42 Distributed storage system

The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

system to store massive data the following factors shouldbe taken into consideration

ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

Mobile Netw Appl (2014) 19171ndash209 185

level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

43 Storage mechanism for big data

Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

431 Database technology

The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

186 Mobile Netw Appl (2014) 19171ndash209

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 5: Big Data: A Survey Min Chen

called such transformation ldquoThe Fourth Paradigmrdquo [23] Healso thought the only way to cope with such paradigm wasto develop a new generation of computing tools to managevisualize and analyze massive data In June 2011 anothermilestone event occurred EMCIDC published a researchreport titled Extracting Values from Chaos [1] which intro-duced the concept and potential of big data for the firsttime This research report triggered the great interest in bothindustry and academia on big data

Over the past few years nearly all major companiesincluding EMC Oracle IBM Microsoft Google Ama-zon and Facebook etc have started their big data projectsTaking IBM as an example since 2005 IBM has investedUSD 16 billion on 30 acquisitions related to big data Inacademia big data was also under the spotlight In 2008Nature published a big data special issue In 2011 Sciencealso launched a special issue on the key technologies ofldquodata processingrdquo in big data In 2012 European ResearchConsortium for Informatics and Mathematics (ERCIM)News published a special issue on big data In the beginningof 2012 a report titled Big Data Big Impact presented at theDavos Forum in Switzerland announced that big data hasbecome a new kind of economic assets just like currencyor gold Gartner an international research agency issuedHype Cycles from 2012 to 2013 which classified big datacomputing social analysis and stored data analysis into 48emerging technologies that deserve most attention

Many national governments such as the US also paidgreat attention to big data In March 2012 the ObamaAdministration announced a USD 200 million investmentto launch the ldquoBig Data Research and Development Planrdquowhich was a second major scientific and technologicaldevelopment initiative after the ldquoInformation Highwayrdquo ini-tiative in 1993 In July 2012 the ldquoVigorous ICT Japanrdquoproject issued by Japanrsquos Ministry of Internal Affairs andCommunications indicated that the big data developmentshould be a national strategy and application technologiesshould be the focus In July 2012 the United Nations issuedBig Data for Development report which summarized howgovernments utilized big data to better serve and protecttheir people

15 Challenges of big data

The sharply increasing data deluge in the big data erabrings about huge challenges on data acquisition storagemanagement and analysis Traditional data managementand analysis systems are based on the relational databasemanagement system (RDBMS) However such RDBMSsonly apply to structured data other than semi-structured orunstructured data In addition RDBMSs are increasinglyutilizing more and more expensive hardware It is appar-ently that the traditional RDBMSs could not handle the

huge volume and heterogeneity of big data The researchcommunity has proposed some solutions from different per-spectives For example cloud computing is utilized to meetthe requirements on infrastructure for big data eg costefficiency elasticity and smooth upgradingdowngradingFor solutions of permanent storage and management oflarge-scale disordered datasets distributed file systems [24]and NoSQL [25] databases are good choices Such program-ming frameworks have achieved great success in processingclustered tasks especially for webpage ranking Various bigdata applications can be developed based on these innova-tive technologies or platforms Moreover it is non-trivial todeploy the big data analysis systems

Some literature [26ndash28] discuss obstacles in the develop-ment of big data applications The key challenges are listedas follows

ndash Data representation many datasets have certain levelsof heterogeneity in type structure semantics organiza-tion granularity and accessibility Data representationaims to make data more meaningful for computer anal-ysis and user interpretation Nevertheless an improperdata representation will reduce the value of the origi-nal data and may even obstruct effective data analysisEfficient data representation shall reflect data structureclass and type as well as integrated technologies so asto enable efficient operations on different datasets

ndash Redundancy reduction and data compression gener-ally there is a high level of redundancy in datasetsRedundancy reduction and data compression is effec-tive to reduce the indirect cost of the entire system onthe premise that the potential values of the data are notaffected For example most data generated by sensornetworks are highly redundant which may be filteredand compressed at orders of magnitude

ndash Data life cycle management compared with the rel-atively slow advances of storage systems pervasivesensing and computing are generating data at unprece-dented rates and scales We are confronted with a lotof pressing challenges one of which is that the currentstorage system could not support such massive dataGenerally speaking values hidden in big data dependon data freshness Therefore a data importance princi-ple related to the analytical value should be developedto decide which data shall be stored and which datashall be discarded

ndash Analytical mechanism the analytical system of big datashall process masses of heterogeneous data within alimited time However traditional RDBMSs are strictlydesigned with a lack of scalability and expandabilitywhich could not meet the performance requirementsNon-relational databases have shown their uniqueadvantages in the processing of unstructured data and

Mobile Netw Appl (2014) 19171ndash209 175

started to become mainstream in big data analysisEven so there are still some problems of non-relationaldatabases in their performance and particular applica-tions We shall find a compromising solution betweenRDBMSs and non-relational databases For examplesome enterprises have utilized a mixed database archi-tecture that integrates the advantages of both types ofdatabase (eg Facebook and Taobao) More researchis needed on the in-memory database and sample databased on approximate analysis

ndash Data confidentiality most big data service providers orowners at present could not effectively maintain andanalyze such huge datasets because of their limitedcapacity They must rely on professionals or tools toanalyze such data which increase the potential safetyrisks For example the transactional dataset generallyincludes a set of complete operating data to drive keybusiness processes Such data contains details of thelowest granularity and some sensitive information suchas credit card numbers Therefore analysis of big datamay be delivered to a third party for processing onlywhen proper preventive measures are taken to protectsuch sensitive data to ensure its safety

ndash Energy management the energy consumption of main-frame computing systems has drawn much attentionfrom both economy and environment perspectives Withthe increase of data volume and analytical demandsthe processing storage and transmission of big datawill inevitably consume more and more electric energyTherefore system-level power consumption controland management mechanism shall be established forbig data while the expandability and accessibility areensured

ndash Expendability and scalability the analytical system ofbig data must support present and future datasets Theanalytical algorithm must be able to process increas-ingly expanding and more complex datasets

ndash Cooperation analysis of big data is an interdisci-plinary research which requires experts in differentfields cooperate to harvest the potential of big dataA comprehensive big data network architecture mustbe established to help scientists and engineers in var-ious fields access different kinds of data and fullyutilize their expertise so as to cooperate to complete theanalytical objectives

2 Related technologies

In order to gain a deep understanding of big data this sec-tion will introduce several fundamental technologies that areclosely related to big data including cloud computing IoTdata center and Hadoop

21 Relationship between cloud computing and big data

Cloud computing is closely related to big data The keycomponents of cloud computing are shown in Fig 3 Bigdata is the object of the computation-intensive operation andstresses the storage capacity of a cloud system The mainobjective of cloud computing is to use huge computing andstorage resources under concentrated management so asto provide big data applications with fine-grained comput-ing capacity The development of cloud computing providessolutions for the storage and processing of big data On theother hand the emergence of big data also accelerates thedevelopment of cloud computing The distributed storagetechnology based on cloud computing can effectively man-age big data the parallel computing capacity by virtue ofcloud computing can improve the efficiency of acquisitionand analyzing big data

Even though there are many overlapped technologiesin cloud computing and big data they differ in the fol-lowing two aspects First the concepts are different to acertain extent Cloud computing transforms the IT archi-tecture while big data influences business decision-makingHowever big data depends on cloud computing as thefundamental infrastructure for smooth operation

Second big data and cloud computing have differenttarget customers Cloud computing is a technology andproduct targeting Chief Information Officers (CIO) as anadvanced IT solution Big data is a product targeting ChiefExecutive Officers (CEO) focusing on business operationsSince the decision makers may directly feel the pressurefrom market competition they must defeat business oppo-nents in more competitive ways With the advances ofbig data and cloud computing these two technologies arecertainly and increasingly entwine with each other Cloudcomputing with functions similar to those of computers andoperating systems provides system-level resources big data

Fig 3 Key components of cloud computing

176 Mobile Netw Appl (2014) 19171ndash209

operates in the upper level supported by cloud computingand provides functions similar to those of database and effi-cient data processing capacity Kissinger President of EMCindicated that the application of big data must be based oncloud computing

The evolution of big data was driven by the rapid growthof application demands and cloud computing developedfrom virtualized technologies Therefore cloud computingnot only provides computation and processing for big databut also itself is a service mode To a certain extent theadvances of cloud computing also promote the developmentof big data both of which supplement each other

22 Relationship between IoT and big data

In the IoT paradigm an enormous amount of networkingsensors are embedded into various devices and machinesin the real world Such sensors deployed in different fieldsmay collect various kinds of data such as environmentaldata geographical data astronomical data and logistic dataMobile equipments transportation facilities public facil-ities and home appliances could all be data acquisitionequipments in IoT as illustrated in Fig 4

The big data generated by IoT has different characteris-tics compared with general big data because of the differenttypes of data collected of which the most classical charac-teristics include heterogeneity variety unstructured featurenoise and high redundancy Although the current IoT datais not the dominant part of big data by 2030 the quantity of

sensors will reach one trillion and then the IoT data will be

the most important part of big data according to the fore-

cast of HP A report from Intel pointed out that big data in

IoT has three features that conform to the big data paradigm

(i) abundant terminals generating masses of data (ii) data

generated by IoT is usually semi-structured or unstructured

(iii) data of IoT is useful only when it is analyzed

At present the data processing capacity of IoT has fallen

behind the collected data and it is extremely urgent to accel-

erate the introduction of big data technologies to promote

the development of IoT Many operators of IoT realize the

importance of big data since the success of IoT is hinged

upon the effective integration of big data and cloud com-

puting The widespread deployment of IoT will also bring

many cities into the big data era

There is a compelling need to adopt big data for IoT

applications while the development of big data is already

legged behind It has been widely recognized that these

two technologies are inter-dependent and should be jointly

developed on one hand the widespread deployment of IoT

drives the high growth of data both in quantity and cate-

gory thus providing the opportunity for the application and

development of big data on the other hand the application

of big data technology to IoT also accelerates the research

advances and business models of of IoT

Fig 4 Illustration of data acquisition equipment in IoT

Mobile Netw Appl (2014) 19171ndash209 177

23 Data center

In the big data paradigm the data center not only is a plat-form for concentrated storage of data but also undertakesmore responsibilities such as acquiring data managingdata organizing data and leveraging the data values andfunctions Data centers mainly concern ldquodatardquo other thanldquocenterrdquo It has masses of data and organizes and man-ages data according to its core objective and developmentpath which is more valuable than owning a good site andresource The emergence of big data brings about sounddevelopment opportunities and great challenges to data cen-ters Big data is an emerging paradigm which will promotethe explosive growth of the infrastructure and related soft-ware of data center The physical data center network isthe core for supporting big data but at present is the keyinfrastructure that is most urgently required [29]

ndash Big data requires data center provide powerful back-stage support The big data paradigm has more strin-gent requirements on storage capacity and processingcapacity as well as network transmission capacityEnterprises must take the development of data centersinto consideration to improve the capacity of rapidlyand effectively processing of big data under limitedpriceperformance ratio The data center shall providethe infrastructure with a large number of nodes build ahigh-speed internal network effectively dissipate heatand effective backup data Only when a highly energy-efficient stable safe expandable and redundant datacenter is built the normal operation of big data applica-tions may be ensured

ndash The growth of big data applications accelerates therevolution and innovation of data centers Many bigdata applications have developed their unique architec-tures and directly promote the development of storagenetwork and computing technologies related to datacenter With the continued growth of the volumes ofstructured and unstructured data and the variety ofsources of analytical data the data processing and com-puting capacities of the data center shall be greatlyenhanced In addition as the scale of data center isincreasingly expanding it is also an important issue onhow to reduce the operational cost for the developmentof data centers

ndash Big data endows more functions to the data center Inthe big data paradigm data center shall not only con-cern with hardware facilities but also strengthen softcapacities ie the capacities of acquisition processingorganization analysis and application of big data Thedata center may help business personnel analyze theexisting data discover problems in business operationand develop solutions from big data

24 Relationship between hadoop and big data

Presently Hadoop is widely used in big data applications inthe industry eg spam filtering network searching click-stream analysis and social recommendation In additionconsiderable academic research is now based on HadoopSome representative cases are given below As declaredin June 2012 Yahoo runs Hadoop in 42000 servers atfour data centers to support its products and services egsearching and spam filtering etc At present the biggestHadoop cluster has 4000 nodes but the number of nodeswill be increased to 10000 with the release of Hadoop 20In the same month Facebook announced that their Hadoopcluster can process 100 PB data which grew by 05 PB perday as in November 2012 Some well-known agencies thatuse Hadoop to conduct distributed computation are listedin [30] In addition many companies provide Hadoop com-mercial execution andor support including Cloudera IBMMapR EMC and Oracle

Among modern industrial machinery and systems sen-sors are widely deployed to collect information for environ-ment monitoring and failure forecasting etc Bahga and oth-ers in [31] proposed a framework for data organization andcloud computing infrastructure termed CloudView Cloud-View uses mixed architectures local nodes and remoteclusters based on Hadoop to analyze machine-generateddata Local nodes are used for the forecast of real-time fail-ures clusters based on Hadoop are used for complex offlineanalysis eg case-driven data analysis

The exponential growth of the genome data and the sharpdrop of sequencing cost transform bio-science and bio-medicine to data-driven science Gunarathne et al in [32]utilized cloud computing infrastructures Amazon AWSMicrosoft Azune and data processing framework basedon MapReduce Hadoop and Microsoft DryadLINQ torun two parallel bio-medicine applications (i) assembly ofgenome segments (ii) dimension reduction in the analy-sis of chemical structure In the subsequent application the166-D datasets used include 26000000 data points Theauthors compared the performance of all the frameworks interms of efficiency cost and availability According to thestudy the authors concluded that the loose coupling will beincreasingly applied to research on electron cloud and theparallel programming technology (MapReduce) frameworkmay provide the user an interface with more convenientservices and reduce unnecessary costs

3 Big data generation and acquisition

We have introduced several key technologies related to bigdata ie cloud computing IoT data center and HadoopNext we will focus on the value chain of big data which

178 Mobile Netw Appl (2014) 19171ndash209

can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

31 Data generation

Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

311 Enterprise data

In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

312 IoT data

As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

According to characteristics of Internet of Things thedata generated from IoT has the following features

ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

Mobile Netw Appl (2014) 19171ndash209 179

and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

313 Bio-medical data

As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

314 Data generation from other fields

As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

32 Big data acquisition

As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

180 Mobile Netw Appl (2014) 19171ndash209

useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

321 Data collection

Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

Mobile Netw Appl (2014) 19171ndash209 181

ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

and collection methods recording through other auxiliarytools

322 Data transportation

Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

182 Mobile Netw Appl (2014) 19171ndash209

mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

323 Data pre-processing

Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

Mobile Netw Appl (2014) 19171ndash209 183

in e-commerce by crawlers and regularly re-copyingcustomer and account information

In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

On generalized data transmission or storage re-peated data deletion is a special data compression

technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

4 Big data storage

The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

184 Mobile Netw Appl (2014) 19171ndash209

41 Storage system for massive data

Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

42 Distributed storage system

The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

system to store massive data the following factors shouldbe taken into consideration

ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

Mobile Netw Appl (2014) 19171ndash209 185

level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

43 Storage mechanism for big data

Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

431 Database technology

The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

186 Mobile Netw Appl (2014) 19171ndash209

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 6: Big Data: A Survey Min Chen

started to become mainstream in big data analysisEven so there are still some problems of non-relationaldatabases in their performance and particular applica-tions We shall find a compromising solution betweenRDBMSs and non-relational databases For examplesome enterprises have utilized a mixed database archi-tecture that integrates the advantages of both types ofdatabase (eg Facebook and Taobao) More researchis needed on the in-memory database and sample databased on approximate analysis

ndash Data confidentiality most big data service providers orowners at present could not effectively maintain andanalyze such huge datasets because of their limitedcapacity They must rely on professionals or tools toanalyze such data which increase the potential safetyrisks For example the transactional dataset generallyincludes a set of complete operating data to drive keybusiness processes Such data contains details of thelowest granularity and some sensitive information suchas credit card numbers Therefore analysis of big datamay be delivered to a third party for processing onlywhen proper preventive measures are taken to protectsuch sensitive data to ensure its safety

ndash Energy management the energy consumption of main-frame computing systems has drawn much attentionfrom both economy and environment perspectives Withthe increase of data volume and analytical demandsthe processing storage and transmission of big datawill inevitably consume more and more electric energyTherefore system-level power consumption controland management mechanism shall be established forbig data while the expandability and accessibility areensured

ndash Expendability and scalability the analytical system ofbig data must support present and future datasets Theanalytical algorithm must be able to process increas-ingly expanding and more complex datasets

ndash Cooperation analysis of big data is an interdisci-plinary research which requires experts in differentfields cooperate to harvest the potential of big dataA comprehensive big data network architecture mustbe established to help scientists and engineers in var-ious fields access different kinds of data and fullyutilize their expertise so as to cooperate to complete theanalytical objectives

2 Related technologies

In order to gain a deep understanding of big data this sec-tion will introduce several fundamental technologies that areclosely related to big data including cloud computing IoTdata center and Hadoop

21 Relationship between cloud computing and big data

Cloud computing is closely related to big data The keycomponents of cloud computing are shown in Fig 3 Bigdata is the object of the computation-intensive operation andstresses the storage capacity of a cloud system The mainobjective of cloud computing is to use huge computing andstorage resources under concentrated management so asto provide big data applications with fine-grained comput-ing capacity The development of cloud computing providessolutions for the storage and processing of big data On theother hand the emergence of big data also accelerates thedevelopment of cloud computing The distributed storagetechnology based on cloud computing can effectively man-age big data the parallel computing capacity by virtue ofcloud computing can improve the efficiency of acquisitionand analyzing big data

Even though there are many overlapped technologiesin cloud computing and big data they differ in the fol-lowing two aspects First the concepts are different to acertain extent Cloud computing transforms the IT archi-tecture while big data influences business decision-makingHowever big data depends on cloud computing as thefundamental infrastructure for smooth operation

Second big data and cloud computing have differenttarget customers Cloud computing is a technology andproduct targeting Chief Information Officers (CIO) as anadvanced IT solution Big data is a product targeting ChiefExecutive Officers (CEO) focusing on business operationsSince the decision makers may directly feel the pressurefrom market competition they must defeat business oppo-nents in more competitive ways With the advances ofbig data and cloud computing these two technologies arecertainly and increasingly entwine with each other Cloudcomputing with functions similar to those of computers andoperating systems provides system-level resources big data

Fig 3 Key components of cloud computing

176 Mobile Netw Appl (2014) 19171ndash209

operates in the upper level supported by cloud computingand provides functions similar to those of database and effi-cient data processing capacity Kissinger President of EMCindicated that the application of big data must be based oncloud computing

The evolution of big data was driven by the rapid growthof application demands and cloud computing developedfrom virtualized technologies Therefore cloud computingnot only provides computation and processing for big databut also itself is a service mode To a certain extent theadvances of cloud computing also promote the developmentof big data both of which supplement each other

22 Relationship between IoT and big data

In the IoT paradigm an enormous amount of networkingsensors are embedded into various devices and machinesin the real world Such sensors deployed in different fieldsmay collect various kinds of data such as environmentaldata geographical data astronomical data and logistic dataMobile equipments transportation facilities public facil-ities and home appliances could all be data acquisitionequipments in IoT as illustrated in Fig 4

The big data generated by IoT has different characteris-tics compared with general big data because of the differenttypes of data collected of which the most classical charac-teristics include heterogeneity variety unstructured featurenoise and high redundancy Although the current IoT datais not the dominant part of big data by 2030 the quantity of

sensors will reach one trillion and then the IoT data will be

the most important part of big data according to the fore-

cast of HP A report from Intel pointed out that big data in

IoT has three features that conform to the big data paradigm

(i) abundant terminals generating masses of data (ii) data

generated by IoT is usually semi-structured or unstructured

(iii) data of IoT is useful only when it is analyzed

At present the data processing capacity of IoT has fallen

behind the collected data and it is extremely urgent to accel-

erate the introduction of big data technologies to promote

the development of IoT Many operators of IoT realize the

importance of big data since the success of IoT is hinged

upon the effective integration of big data and cloud com-

puting The widespread deployment of IoT will also bring

many cities into the big data era

There is a compelling need to adopt big data for IoT

applications while the development of big data is already

legged behind It has been widely recognized that these

two technologies are inter-dependent and should be jointly

developed on one hand the widespread deployment of IoT

drives the high growth of data both in quantity and cate-

gory thus providing the opportunity for the application and

development of big data on the other hand the application

of big data technology to IoT also accelerates the research

advances and business models of of IoT

Fig 4 Illustration of data acquisition equipment in IoT

Mobile Netw Appl (2014) 19171ndash209 177

23 Data center

In the big data paradigm the data center not only is a plat-form for concentrated storage of data but also undertakesmore responsibilities such as acquiring data managingdata organizing data and leveraging the data values andfunctions Data centers mainly concern ldquodatardquo other thanldquocenterrdquo It has masses of data and organizes and man-ages data according to its core objective and developmentpath which is more valuable than owning a good site andresource The emergence of big data brings about sounddevelopment opportunities and great challenges to data cen-ters Big data is an emerging paradigm which will promotethe explosive growth of the infrastructure and related soft-ware of data center The physical data center network isthe core for supporting big data but at present is the keyinfrastructure that is most urgently required [29]

ndash Big data requires data center provide powerful back-stage support The big data paradigm has more strin-gent requirements on storage capacity and processingcapacity as well as network transmission capacityEnterprises must take the development of data centersinto consideration to improve the capacity of rapidlyand effectively processing of big data under limitedpriceperformance ratio The data center shall providethe infrastructure with a large number of nodes build ahigh-speed internal network effectively dissipate heatand effective backup data Only when a highly energy-efficient stable safe expandable and redundant datacenter is built the normal operation of big data applica-tions may be ensured

ndash The growth of big data applications accelerates therevolution and innovation of data centers Many bigdata applications have developed their unique architec-tures and directly promote the development of storagenetwork and computing technologies related to datacenter With the continued growth of the volumes ofstructured and unstructured data and the variety ofsources of analytical data the data processing and com-puting capacities of the data center shall be greatlyenhanced In addition as the scale of data center isincreasingly expanding it is also an important issue onhow to reduce the operational cost for the developmentof data centers

ndash Big data endows more functions to the data center Inthe big data paradigm data center shall not only con-cern with hardware facilities but also strengthen softcapacities ie the capacities of acquisition processingorganization analysis and application of big data Thedata center may help business personnel analyze theexisting data discover problems in business operationand develop solutions from big data

24 Relationship between hadoop and big data

Presently Hadoop is widely used in big data applications inthe industry eg spam filtering network searching click-stream analysis and social recommendation In additionconsiderable academic research is now based on HadoopSome representative cases are given below As declaredin June 2012 Yahoo runs Hadoop in 42000 servers atfour data centers to support its products and services egsearching and spam filtering etc At present the biggestHadoop cluster has 4000 nodes but the number of nodeswill be increased to 10000 with the release of Hadoop 20In the same month Facebook announced that their Hadoopcluster can process 100 PB data which grew by 05 PB perday as in November 2012 Some well-known agencies thatuse Hadoop to conduct distributed computation are listedin [30] In addition many companies provide Hadoop com-mercial execution andor support including Cloudera IBMMapR EMC and Oracle

Among modern industrial machinery and systems sen-sors are widely deployed to collect information for environ-ment monitoring and failure forecasting etc Bahga and oth-ers in [31] proposed a framework for data organization andcloud computing infrastructure termed CloudView Cloud-View uses mixed architectures local nodes and remoteclusters based on Hadoop to analyze machine-generateddata Local nodes are used for the forecast of real-time fail-ures clusters based on Hadoop are used for complex offlineanalysis eg case-driven data analysis

The exponential growth of the genome data and the sharpdrop of sequencing cost transform bio-science and bio-medicine to data-driven science Gunarathne et al in [32]utilized cloud computing infrastructures Amazon AWSMicrosoft Azune and data processing framework basedon MapReduce Hadoop and Microsoft DryadLINQ torun two parallel bio-medicine applications (i) assembly ofgenome segments (ii) dimension reduction in the analy-sis of chemical structure In the subsequent application the166-D datasets used include 26000000 data points Theauthors compared the performance of all the frameworks interms of efficiency cost and availability According to thestudy the authors concluded that the loose coupling will beincreasingly applied to research on electron cloud and theparallel programming technology (MapReduce) frameworkmay provide the user an interface with more convenientservices and reduce unnecessary costs

3 Big data generation and acquisition

We have introduced several key technologies related to bigdata ie cloud computing IoT data center and HadoopNext we will focus on the value chain of big data which

178 Mobile Netw Appl (2014) 19171ndash209

can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

31 Data generation

Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

311 Enterprise data

In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

312 IoT data

As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

According to characteristics of Internet of Things thedata generated from IoT has the following features

ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

Mobile Netw Appl (2014) 19171ndash209 179

and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

313 Bio-medical data

As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

314 Data generation from other fields

As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

32 Big data acquisition

As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

180 Mobile Netw Appl (2014) 19171ndash209

useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

321 Data collection

Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

Mobile Netw Appl (2014) 19171ndash209 181

ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

and collection methods recording through other auxiliarytools

322 Data transportation

Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

182 Mobile Netw Appl (2014) 19171ndash209

mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

323 Data pre-processing

Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

Mobile Netw Appl (2014) 19171ndash209 183

in e-commerce by crawlers and regularly re-copyingcustomer and account information

In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

On generalized data transmission or storage re-peated data deletion is a special data compression

technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

4 Big data storage

The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

184 Mobile Netw Appl (2014) 19171ndash209

41 Storage system for massive data

Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

42 Distributed storage system

The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

system to store massive data the following factors shouldbe taken into consideration

ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

Mobile Netw Appl (2014) 19171ndash209 185

level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

43 Storage mechanism for big data

Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

431 Database technology

The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

186 Mobile Netw Appl (2014) 19171ndash209

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 7: Big Data: A Survey Min Chen

operates in the upper level supported by cloud computingand provides functions similar to those of database and effi-cient data processing capacity Kissinger President of EMCindicated that the application of big data must be based oncloud computing

The evolution of big data was driven by the rapid growthof application demands and cloud computing developedfrom virtualized technologies Therefore cloud computingnot only provides computation and processing for big databut also itself is a service mode To a certain extent theadvances of cloud computing also promote the developmentof big data both of which supplement each other

22 Relationship between IoT and big data

In the IoT paradigm an enormous amount of networkingsensors are embedded into various devices and machinesin the real world Such sensors deployed in different fieldsmay collect various kinds of data such as environmentaldata geographical data astronomical data and logistic dataMobile equipments transportation facilities public facil-ities and home appliances could all be data acquisitionequipments in IoT as illustrated in Fig 4

The big data generated by IoT has different characteris-tics compared with general big data because of the differenttypes of data collected of which the most classical charac-teristics include heterogeneity variety unstructured featurenoise and high redundancy Although the current IoT datais not the dominant part of big data by 2030 the quantity of

sensors will reach one trillion and then the IoT data will be

the most important part of big data according to the fore-

cast of HP A report from Intel pointed out that big data in

IoT has three features that conform to the big data paradigm

(i) abundant terminals generating masses of data (ii) data

generated by IoT is usually semi-structured or unstructured

(iii) data of IoT is useful only when it is analyzed

At present the data processing capacity of IoT has fallen

behind the collected data and it is extremely urgent to accel-

erate the introduction of big data technologies to promote

the development of IoT Many operators of IoT realize the

importance of big data since the success of IoT is hinged

upon the effective integration of big data and cloud com-

puting The widespread deployment of IoT will also bring

many cities into the big data era

There is a compelling need to adopt big data for IoT

applications while the development of big data is already

legged behind It has been widely recognized that these

two technologies are inter-dependent and should be jointly

developed on one hand the widespread deployment of IoT

drives the high growth of data both in quantity and cate-

gory thus providing the opportunity for the application and

development of big data on the other hand the application

of big data technology to IoT also accelerates the research

advances and business models of of IoT

Fig 4 Illustration of data acquisition equipment in IoT

Mobile Netw Appl (2014) 19171ndash209 177

23 Data center

In the big data paradigm the data center not only is a plat-form for concentrated storage of data but also undertakesmore responsibilities such as acquiring data managingdata organizing data and leveraging the data values andfunctions Data centers mainly concern ldquodatardquo other thanldquocenterrdquo It has masses of data and organizes and man-ages data according to its core objective and developmentpath which is more valuable than owning a good site andresource The emergence of big data brings about sounddevelopment opportunities and great challenges to data cen-ters Big data is an emerging paradigm which will promotethe explosive growth of the infrastructure and related soft-ware of data center The physical data center network isthe core for supporting big data but at present is the keyinfrastructure that is most urgently required [29]

ndash Big data requires data center provide powerful back-stage support The big data paradigm has more strin-gent requirements on storage capacity and processingcapacity as well as network transmission capacityEnterprises must take the development of data centersinto consideration to improve the capacity of rapidlyand effectively processing of big data under limitedpriceperformance ratio The data center shall providethe infrastructure with a large number of nodes build ahigh-speed internal network effectively dissipate heatand effective backup data Only when a highly energy-efficient stable safe expandable and redundant datacenter is built the normal operation of big data applica-tions may be ensured

ndash The growth of big data applications accelerates therevolution and innovation of data centers Many bigdata applications have developed their unique architec-tures and directly promote the development of storagenetwork and computing technologies related to datacenter With the continued growth of the volumes ofstructured and unstructured data and the variety ofsources of analytical data the data processing and com-puting capacities of the data center shall be greatlyenhanced In addition as the scale of data center isincreasingly expanding it is also an important issue onhow to reduce the operational cost for the developmentof data centers

ndash Big data endows more functions to the data center Inthe big data paradigm data center shall not only con-cern with hardware facilities but also strengthen softcapacities ie the capacities of acquisition processingorganization analysis and application of big data Thedata center may help business personnel analyze theexisting data discover problems in business operationand develop solutions from big data

24 Relationship between hadoop and big data

Presently Hadoop is widely used in big data applications inthe industry eg spam filtering network searching click-stream analysis and social recommendation In additionconsiderable academic research is now based on HadoopSome representative cases are given below As declaredin June 2012 Yahoo runs Hadoop in 42000 servers atfour data centers to support its products and services egsearching and spam filtering etc At present the biggestHadoop cluster has 4000 nodes but the number of nodeswill be increased to 10000 with the release of Hadoop 20In the same month Facebook announced that their Hadoopcluster can process 100 PB data which grew by 05 PB perday as in November 2012 Some well-known agencies thatuse Hadoop to conduct distributed computation are listedin [30] In addition many companies provide Hadoop com-mercial execution andor support including Cloudera IBMMapR EMC and Oracle

Among modern industrial machinery and systems sen-sors are widely deployed to collect information for environ-ment monitoring and failure forecasting etc Bahga and oth-ers in [31] proposed a framework for data organization andcloud computing infrastructure termed CloudView Cloud-View uses mixed architectures local nodes and remoteclusters based on Hadoop to analyze machine-generateddata Local nodes are used for the forecast of real-time fail-ures clusters based on Hadoop are used for complex offlineanalysis eg case-driven data analysis

The exponential growth of the genome data and the sharpdrop of sequencing cost transform bio-science and bio-medicine to data-driven science Gunarathne et al in [32]utilized cloud computing infrastructures Amazon AWSMicrosoft Azune and data processing framework basedon MapReduce Hadoop and Microsoft DryadLINQ torun two parallel bio-medicine applications (i) assembly ofgenome segments (ii) dimension reduction in the analy-sis of chemical structure In the subsequent application the166-D datasets used include 26000000 data points Theauthors compared the performance of all the frameworks interms of efficiency cost and availability According to thestudy the authors concluded that the loose coupling will beincreasingly applied to research on electron cloud and theparallel programming technology (MapReduce) frameworkmay provide the user an interface with more convenientservices and reduce unnecessary costs

3 Big data generation and acquisition

We have introduced several key technologies related to bigdata ie cloud computing IoT data center and HadoopNext we will focus on the value chain of big data which

178 Mobile Netw Appl (2014) 19171ndash209

can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

31 Data generation

Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

311 Enterprise data

In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

312 IoT data

As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

According to characteristics of Internet of Things thedata generated from IoT has the following features

ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

Mobile Netw Appl (2014) 19171ndash209 179

and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

313 Bio-medical data

As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

314 Data generation from other fields

As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

32 Big data acquisition

As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

180 Mobile Netw Appl (2014) 19171ndash209

useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

321 Data collection

Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

Mobile Netw Appl (2014) 19171ndash209 181

ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

and collection methods recording through other auxiliarytools

322 Data transportation

Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

182 Mobile Netw Appl (2014) 19171ndash209

mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

323 Data pre-processing

Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

Mobile Netw Appl (2014) 19171ndash209 183

in e-commerce by crawlers and regularly re-copyingcustomer and account information

In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

On generalized data transmission or storage re-peated data deletion is a special data compression

technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

4 Big data storage

The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

184 Mobile Netw Appl (2014) 19171ndash209

41 Storage system for massive data

Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

42 Distributed storage system

The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

system to store massive data the following factors shouldbe taken into consideration

ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

Mobile Netw Appl (2014) 19171ndash209 185

level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

43 Storage mechanism for big data

Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

431 Database technology

The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

186 Mobile Netw Appl (2014) 19171ndash209

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 8: Big Data: A Survey Min Chen

23 Data center

In the big data paradigm the data center not only is a plat-form for concentrated storage of data but also undertakesmore responsibilities such as acquiring data managingdata organizing data and leveraging the data values andfunctions Data centers mainly concern ldquodatardquo other thanldquocenterrdquo It has masses of data and organizes and man-ages data according to its core objective and developmentpath which is more valuable than owning a good site andresource The emergence of big data brings about sounddevelopment opportunities and great challenges to data cen-ters Big data is an emerging paradigm which will promotethe explosive growth of the infrastructure and related soft-ware of data center The physical data center network isthe core for supporting big data but at present is the keyinfrastructure that is most urgently required [29]

ndash Big data requires data center provide powerful back-stage support The big data paradigm has more strin-gent requirements on storage capacity and processingcapacity as well as network transmission capacityEnterprises must take the development of data centersinto consideration to improve the capacity of rapidlyand effectively processing of big data under limitedpriceperformance ratio The data center shall providethe infrastructure with a large number of nodes build ahigh-speed internal network effectively dissipate heatand effective backup data Only when a highly energy-efficient stable safe expandable and redundant datacenter is built the normal operation of big data applica-tions may be ensured

ndash The growth of big data applications accelerates therevolution and innovation of data centers Many bigdata applications have developed their unique architec-tures and directly promote the development of storagenetwork and computing technologies related to datacenter With the continued growth of the volumes ofstructured and unstructured data and the variety ofsources of analytical data the data processing and com-puting capacities of the data center shall be greatlyenhanced In addition as the scale of data center isincreasingly expanding it is also an important issue onhow to reduce the operational cost for the developmentof data centers

ndash Big data endows more functions to the data center Inthe big data paradigm data center shall not only con-cern with hardware facilities but also strengthen softcapacities ie the capacities of acquisition processingorganization analysis and application of big data Thedata center may help business personnel analyze theexisting data discover problems in business operationand develop solutions from big data

24 Relationship between hadoop and big data

Presently Hadoop is widely used in big data applications inthe industry eg spam filtering network searching click-stream analysis and social recommendation In additionconsiderable academic research is now based on HadoopSome representative cases are given below As declaredin June 2012 Yahoo runs Hadoop in 42000 servers atfour data centers to support its products and services egsearching and spam filtering etc At present the biggestHadoop cluster has 4000 nodes but the number of nodeswill be increased to 10000 with the release of Hadoop 20In the same month Facebook announced that their Hadoopcluster can process 100 PB data which grew by 05 PB perday as in November 2012 Some well-known agencies thatuse Hadoop to conduct distributed computation are listedin [30] In addition many companies provide Hadoop com-mercial execution andor support including Cloudera IBMMapR EMC and Oracle

Among modern industrial machinery and systems sen-sors are widely deployed to collect information for environ-ment monitoring and failure forecasting etc Bahga and oth-ers in [31] proposed a framework for data organization andcloud computing infrastructure termed CloudView Cloud-View uses mixed architectures local nodes and remoteclusters based on Hadoop to analyze machine-generateddata Local nodes are used for the forecast of real-time fail-ures clusters based on Hadoop are used for complex offlineanalysis eg case-driven data analysis

The exponential growth of the genome data and the sharpdrop of sequencing cost transform bio-science and bio-medicine to data-driven science Gunarathne et al in [32]utilized cloud computing infrastructures Amazon AWSMicrosoft Azune and data processing framework basedon MapReduce Hadoop and Microsoft DryadLINQ torun two parallel bio-medicine applications (i) assembly ofgenome segments (ii) dimension reduction in the analy-sis of chemical structure In the subsequent application the166-D datasets used include 26000000 data points Theauthors compared the performance of all the frameworks interms of efficiency cost and availability According to thestudy the authors concluded that the loose coupling will beincreasingly applied to research on electron cloud and theparallel programming technology (MapReduce) frameworkmay provide the user an interface with more convenientservices and reduce unnecessary costs

3 Big data generation and acquisition

We have introduced several key technologies related to bigdata ie cloud computing IoT data center and HadoopNext we will focus on the value chain of big data which

178 Mobile Netw Appl (2014) 19171ndash209

can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

31 Data generation

Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

311 Enterprise data

In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

312 IoT data

As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

According to characteristics of Internet of Things thedata generated from IoT has the following features

ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

Mobile Netw Appl (2014) 19171ndash209 179

and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

313 Bio-medical data

As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

314 Data generation from other fields

As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

32 Big data acquisition

As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

180 Mobile Netw Appl (2014) 19171ndash209

useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

321 Data collection

Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

Mobile Netw Appl (2014) 19171ndash209 181

ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

and collection methods recording through other auxiliarytools

322 Data transportation

Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

182 Mobile Netw Appl (2014) 19171ndash209

mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

323 Data pre-processing

Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

Mobile Netw Appl (2014) 19171ndash209 183

in e-commerce by crawlers and regularly re-copyingcustomer and account information

In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

On generalized data transmission or storage re-peated data deletion is a special data compression

technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

4 Big data storage

The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

184 Mobile Netw Appl (2014) 19171ndash209

41 Storage system for massive data

Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

42 Distributed storage system

The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

system to store massive data the following factors shouldbe taken into consideration

ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

Mobile Netw Appl (2014) 19171ndash209 185

level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

43 Storage mechanism for big data

Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

431 Database technology

The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

186 Mobile Netw Appl (2014) 19171ndash209

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 9: Big Data: A Survey Min Chen

can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

31 Data generation

Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

311 Enterprise data

In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

312 IoT data

As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

According to characteristics of Internet of Things thedata generated from IoT has the following features

ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

Mobile Netw Appl (2014) 19171ndash209 179

and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

313 Bio-medical data

As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

314 Data generation from other fields

As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

32 Big data acquisition

As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

180 Mobile Netw Appl (2014) 19171ndash209

useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

321 Data collection

Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

Mobile Netw Appl (2014) 19171ndash209 181

ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

and collection methods recording through other auxiliarytools

322 Data transportation

Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

182 Mobile Netw Appl (2014) 19171ndash209

mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

323 Data pre-processing

Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

Mobile Netw Appl (2014) 19171ndash209 183

in e-commerce by crawlers and regularly re-copyingcustomer and account information

In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

On generalized data transmission or storage re-peated data deletion is a special data compression

technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

4 Big data storage

The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

184 Mobile Netw Appl (2014) 19171ndash209

41 Storage system for massive data

Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

42 Distributed storage system

The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

system to store massive data the following factors shouldbe taken into consideration

ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

Mobile Netw Appl (2014) 19171ndash209 185

level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

43 Storage mechanism for big data

Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

431 Database technology

The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

186 Mobile Netw Appl (2014) 19171ndash209

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 10: Big Data: A Survey Min Chen

and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

313 Bio-medical data

As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

314 Data generation from other fields

As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

32 Big data acquisition

As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

180 Mobile Netw Appl (2014) 19171ndash209

useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

321 Data collection

Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

Mobile Netw Appl (2014) 19171ndash209 181

ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

and collection methods recording through other auxiliarytools

322 Data transportation

Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

182 Mobile Netw Appl (2014) 19171ndash209

mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

323 Data pre-processing

Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

Mobile Netw Appl (2014) 19171ndash209 183

in e-commerce by crawlers and regularly re-copyingcustomer and account information

In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

On generalized data transmission or storage re-peated data deletion is a special data compression

technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

4 Big data storage

The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

184 Mobile Netw Appl (2014) 19171ndash209

41 Storage system for massive data

Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

42 Distributed storage system

The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

system to store massive data the following factors shouldbe taken into consideration

ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

Mobile Netw Appl (2014) 19171ndash209 185

level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

43 Storage mechanism for big data

Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

431 Database technology

The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

186 Mobile Netw Appl (2014) 19171ndash209

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 11: Big Data: A Survey Min Chen

useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

321 Data collection

Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

Mobile Netw Appl (2014) 19171ndash209 181

ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

and collection methods recording through other auxiliarytools

322 Data transportation

Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

182 Mobile Netw Appl (2014) 19171ndash209

mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

323 Data pre-processing

Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

Mobile Netw Appl (2014) 19171ndash209 183

in e-commerce by crawlers and regularly re-copyingcustomer and account information

In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

On generalized data transmission or storage re-peated data deletion is a special data compression

technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

4 Big data storage

The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

184 Mobile Netw Appl (2014) 19171ndash209

41 Storage system for massive data

Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

42 Distributed storage system

The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

system to store massive data the following factors shouldbe taken into consideration

ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

Mobile Netw Appl (2014) 19171ndash209 185

level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

43 Storage mechanism for big data

Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

431 Database technology

The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

186 Mobile Netw Appl (2014) 19171ndash209

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 12: Big Data: A Survey Min Chen

ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

and collection methods recording through other auxiliarytools

322 Data transportation

Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

182 Mobile Netw Appl (2014) 19171ndash209

mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

323 Data pre-processing

Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

Mobile Netw Appl (2014) 19171ndash209 183

in e-commerce by crawlers and regularly re-copyingcustomer and account information

In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

On generalized data transmission or storage re-peated data deletion is a special data compression

technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

4 Big data storage

The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

184 Mobile Netw Appl (2014) 19171ndash209

41 Storage system for massive data

Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

42 Distributed storage system

The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

system to store massive data the following factors shouldbe taken into consideration

ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

Mobile Netw Appl (2014) 19171ndash209 185

level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

43 Storage mechanism for big data

Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

431 Database technology

The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

186 Mobile Netw Appl (2014) 19171ndash209

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 13: Big Data: A Survey Min Chen

mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

323 Data pre-processing

Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

Mobile Netw Appl (2014) 19171ndash209 183

in e-commerce by crawlers and regularly re-copyingcustomer and account information

In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

On generalized data transmission or storage re-peated data deletion is a special data compression

technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

4 Big data storage

The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

184 Mobile Netw Appl (2014) 19171ndash209

41 Storage system for massive data

Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

42 Distributed storage system

The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

system to store massive data the following factors shouldbe taken into consideration

ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

Mobile Netw Appl (2014) 19171ndash209 185

level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

43 Storage mechanism for big data

Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

431 Database technology

The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

186 Mobile Netw Appl (2014) 19171ndash209

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 14: Big Data: A Survey Min Chen

in e-commerce by crawlers and regularly re-copyingcustomer and account information

In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

On generalized data transmission or storage re-peated data deletion is a special data compression

technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

4 Big data storage

The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

184 Mobile Netw Appl (2014) 19171ndash209

41 Storage system for massive data

Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

42 Distributed storage system

The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

system to store massive data the following factors shouldbe taken into consideration

ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

Mobile Netw Appl (2014) 19171ndash209 185

level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

43 Storage mechanism for big data

Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

431 Database technology

The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

186 Mobile Netw Appl (2014) 19171ndash209

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 15: Big Data: A Survey Min Chen

41 Storage system for massive data

Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

42 Distributed storage system

The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

system to store massive data the following factors shouldbe taken into consideration

ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

Mobile Netw Appl (2014) 19171ndash209 185

level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

43 Storage mechanism for big data

Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

431 Database technology

The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

186 Mobile Netw Appl (2014) 19171ndash209

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 16: Big Data: A Survey Min Chen

level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

43 Storage mechanism for big data

Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

431 Database technology

The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

186 Mobile Netw Appl (2014) 19171ndash209

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 17: Big Data: A Survey Min Chen

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 18: Big Data: A Survey Min Chen

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 19: Big Data: A Survey Min Chen

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 20: Big Data: A Survey Min Chen

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 21: Big Data: A Survey Min Chen

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 22: Big Data: A Survey Min Chen

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 23: Big Data: A Survey Min Chen

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 24: Big Data: A Survey Min Chen

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 25: Big Data: A Survey Min Chen

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 26: Big Data: A Survey Min Chen

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 27: Big Data: A Survey Min Chen

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 28: Big Data: A Survey Min Chen

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 29: Big Data: A Survey Min Chen

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 30: Big Data: A Survey Min Chen

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 31: Big Data: A Survey Min Chen

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 32: Big Data: A Survey Min Chen

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 33: Big Data: A Survey Min Chen

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 34: Big Data: A Survey Min Chen

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 35: Big Data: A Survey Min Chen

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 36: Big Data: A Survey Min Chen

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 37: Big Data: A Survey Min Chen

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 38: Big Data: A Survey Min Chen

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References
Page 39: Big Data: A Survey Min Chen

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References

Recommended