Copyright © IJIFR 2013 Author’s Research Area: Big Data
Available Online at: - http://www.ijifr.com/searchjournal.aspx
www.ijifr.com [email protected] ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH An Enlightening Online Open Access, Refereed & Indexed Journal of Multidisciplinary Research
Volume -1 Issue -7, March 2014
22
Big Data: A Detailed Review
Abstract
In this paper, Big Data, its background and the four phases of the value chain of big data i.e.,
data generation, data acquisition, data storage, and data analysis are being explained. The paper
highlights the need of switching to NoSQL/Big Data and describes its characteristics and
applications. Further different Storage mechanisms and major points of differences between
Hbase, FlockDB, Cassandra, SimpleDB, CouchDB and Neo4j are discussed. The paper provided
a review of several published papers, white papers, seminar presentations and published articles
in order to gain a complete picture about the background, emergence, utility, applications,
comparison between different techniques and open issues in the area of BigData. The paper also
discusses about open issues and risks involved with this new emerging technology. Finally the
survey is concluded with a discussion of open problems and future directions.
Keywords: Big Data, Cloud Computing, Internet of Things, Data Center, Hadoop, NoSQL
1. Introduction
Over the past 20 years, data has increased in a large scale in various fields. According to a report from
International Data Corporation (IDC), in 2011, the overall created and copied data volume in the world
was 1.8ZB (˜ 10 B), which increased by nearly nine times within five years [1]. This figure will double at
least every other two years in the near future. Under the explosive increase of global data, the term of Big
Data is mainly used to describe enormous datasets. Big Data typically includes masses of unstructured
data that need more real time analysis [23]. The sharply increasing data deluge in the big data era brings
about huge challenges on data acquisition, storage, management and analysis. Traditional data
management and analysis systems are based on the relational database management system (RDBMS).
Ms. Subita Kumari Research Scholar
Computer Science and Engineering Department UIET, MDU, Rohtak, Haryana, India
PA
PE
R I
D:
IJIF
R /
V1 /
E7
/ 0
17
Ms. Subita Kumari: Big Data: A Detailed Review
www.ijifr.com Email: [email protected] © IJIFR 2013
This paper is available online at - http://www.ijifr.com/searchjournal.aspx
PAPER ID: IJIFR/V1/E7/017
ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014
Author’s Subject Area: Big Data, Page No.: 22-33
23
However, such RDBMSs only apply to structured data, other than semi-structured or unstructured data. In
addition, RDBMSs are increasingly utilizing more and more expensive hardware. It is apparent that the
traditional RDBMSs could not handle the huge volume and heterogeneity of big data. For solutions of
permanent storage and management of large-scale disordered datasets, distributed file systems and
NoSQL (Not Only SQL) databases are good choices.
Nowadays, big data related to the service of Internet companies grow rapidly. For example,
Google processes data of hundreds of Petabyte (PB), Facebook generates log data of over 10 PB per
month. per day. Figure 1 illustrates the boom of the global data volume. While the amount of large
datasets is drastically rising, it also brings about many challenging problems demanding prompt solutions.
The key challenges are Data representation, Redundancy reduction and data compression, Data life cycle
management, Analytical mechanism, Data confidentiality, Energy management, Expendability and
scalability and Cooperation.
Figure 1: The Continuously Growing Big Data
Big data is an abstract concept. Apart from masses of data, it also has some other features, which
determine the difference between itself and “massive data” or “very big data.”. In general, big data shall
mean the datasets that could not be perceived, acquired, managed, and processed by traditional IT and
software/hardware tools within a tolerable time. In 2010, Apache Hadoop defined big data as “datasets
which could not be captured, managed, and processed by general computers within an acceptable scope.”
At present, big data generally ranges from several TB to several PB [1]. In 2011, an IDC report defined
big data as “big data technologies describe a new generation of technologies and architectures, designed
to economically extract value from very large volumes of a wide variety of data, by enabling the high-
Ms. Subita Kumari: Big Data: A Detailed Review
www.ijifr.com Email: [email protected] © IJIFR 2013
This paper is available online at - http://www.ijifr.com/searchjournal.aspx
PAPER ID: IJIFR/V1/E7/017
ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014
Author’s Subject Area: Big Data, Page No.: 22-33
24
velocity capture, discovery, and/or analysis.” [23] . With this definition, characteristics of big data may be
summarized as four Vs, i.e., Volume (great volume), Variety (various modalities), Velocity (rapid
generation), and Value (huge value but very low density), as shown in Figure 2. However, many
challenges on big data arose. With the development of Internet services, indexes and queried contents
were rapidly growing. Therefore, search engine companies had to face the challenges of handling such
big data. Google created GFS [26] and Map Reduce [27] programming models to cope with the
challenges brought about by data management and analysis at the Internet scale. In addition, contents
generated by users, sensors, and other ubiquitous data sources also fueled the overwhelming data flows,
which required a fundamental change on the computing architecture and large-scale data processing
mechanism.
Figure 2: 4V’s of BigData
2. Literature Review
John Gantz and David Reinsel , in 2011 , said that The growth of the digital universe continues to outpace
the growth of storage capacity. But keep in mind that a gigabyte of stored content can generate a petabyte
or more of transient data that we typically don't store (e.g., digital TV signals we watch but don't record,
voice calls that are made digital in the network backbone for the duration of a call). So, like our physical
universe, the digital universe is something to behold — 1.8 trillion gigabytes in 500 quadrillion "files" —
and more than doubling every two years. That's nearly as many bits of information in the digital universe
as stars in our physical universe [1].
In 2012, Danah boyd & Kate Crawford presented that the era of Big Data has begun. Computer
scientists, physicists, economists, mathematicians, political scientists, bio-informaticists, sociologists, and
other scholars are clamoring for access to the massive quantities of information produced by and about
people, things, and their interactions. Six Provocations given them for Big Data are :
1. Big Data changes the definition of knowledge
2. Claims to objectivity and accuracy are misleading
Ms. Subita Kumari: Big Data: A Detailed Review
www.ijifr.com Email: [email protected] © IJIFR 2013
This paper is available online at - http://www.ijifr.com/searchjournal.aspx
PAPER ID: IJIFR/V1/E7/017
ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014
Author’s Subject Area: Big Data, Page No.: 22-33
25
3. Bigger data are not always better data
4. Taken out of context, Big Data loses its meaning
5. Just because it is accessible does not make it ethical
6. Limited access to Big Data creates new digital divides [2]
Mango DB white paper states that companies should consider 5 critical dimensions to make the right
choice for their applications and their businesses. These are - Data Model, Query Model, Consistency
Model, APIs & Commercial Support and Community Strength [3] Moniruzzaman, A. B. M.& Hossain,
Syed Akhter presented a paper to motivate - classification, characteristics and evaluation of NoSQL
databases in Big Data Analytics. This report is intended to help users, especially to the organizations to
obtain an independent understanding of the strengths and weaknesses of various NoSQL database
approaches to supporting applications that process huge volumes of data. [4] Jongwook Woo, Siddharth
Basopia and Yuhang Xu presents a paper for HBase schema to process transaction data for Market Basket
Analysis algorithm. The algorithm runs on Hadoop MapReduce by reading data from both HBase and
HDFS. The experimental results show that the code with Map/Reduce increases the performance as
adding more nodes but at a certain point, there is a bottle-neck that does not allow the performance gain.
Besides, executing the algorithm with data in HBase is slower than in HDFS. [5]
In 2012, Vatika Sharma and Meenu Dave presented a paper with characteristics of NoSQL e.g. NoSQL
does not use the relational data model thus does not use SQL language, stores large volume of data, it is
used without any inconsistency in distributed environment, its open source database, does not have any
fixed schema, does not use concept of ACID properties, high performance and has more flexible
structure. [6] Kailash Raj Joshi submitted in his paper the pros and cons of both the technologies and tried
to address issues involving data visualization. Characteristics such as flexibility, low latency, scalability,
schema-less, fast query, and performance are some major advantages of a NoSQL database. [7]
CouchBase NoSQL white paper states, “Today, the use of NoSQL technology is rising rapidly among
Internet companies and the enterprise. It’s increasingly considered a viable alternative to relational
databases". [8]
In an international seminar in 2012, Arto Salminen shared information about the NoSQL DBs being used
by big organizations e.g. Google (BigTable, LevelDB), LinkedIn (Voldemort), Facebook (Cassandra),
Twitter (Hadoop/Hbase, FlockDB, Cassandra), Netflix (SimpleDB, Hadoop/HBase, Cassandra) and
CERN (CouchDB) [9] Looking at the emerging BigData trends NYTimes.com states, Data is not only
becoming more available but also more understandable to computers. Most of the Big Data surge is data
in the wild — unruly stuff like words, images and video on the Web and those streams of sensor data. It is
called unstructured data and is not typically grist for traditional databases. "GOOD with numbers?
Fascinated by data? The sound you hear is opportunity knocking". [10] In December 2012, Hsinchun
Chen , Roger H & Veda C said ,"Now, in this era of Big Data, even while BI&A 2.0 is still maturing, we
find ourselves poised at the brink of BI&A 3.0, with all the attendant uncertainty that new and potentially
revolutionary technologies bring". [11] A White Paper by TATA Consultancy Services describes various
breeds, choices, and tradeoffs to enable enterprises to make informed decisions to optimize the utilization
Ms. Subita Kumari: Big Data: A Detailed Review
www.ijifr.com Email: [email protected] © IJIFR 2013
This paper is available online at - http://www.ijifr.com/searchjournal.aspx
PAPER ID: IJIFR/V1/E7/017
ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014
Author’s Subject Area: Big Data, Page No.: 22-33
26
of NoSQL databases. [12] Darshana Shimpi & Sangita Chaudhari have presented comparison of current
graph databases and it has been observed that Neo4j is the best in all current graph databases. [13]
In a paper submitted in April 2013, Renu Kanwar, Prakriti Trivedi & Kuldeep Singh claimed that NoSQL
is the solution for use cases where ACID is not the major concern and uses BASE instead which works up
on eventual consistency. [14] In 2013, Mirko Kampf and Jan W. Kantelhardt describe a computational
framework for time-series analysis. Generic data structures represent different types of time series, e. g.
event and inter event time series, and define reliable interfaces to existing big data. [15] David Navetta
states that the potential uses and benefits of Big Data are endless. Unfortunately, Big Data also poses
some risk to both the companies seeking to unlock its potential and the individuals whose information is
now continuously being collected, combined, mined, analyzed, disclosed, and acted upon. Big Data and
some of the privacy related legal issues and risks associated with it are explained in this paper. [16]
Matt Bishop says Big data is revolutionizing our view of science, and has the potential to do the same for
the social sciences and humanities. With the benefits come very serious potential problems, ranging from
invasion of personal privacy to enabling spectacular failures of analytics. This article discusses some of
them. [17] In 2013, Scott J. Lusher, Ross McGuire, Renevan Schaik, David Nicholson and JacobdeVlieg
highlighted the benefits of BigData in the field of medicine. They said, " The exploitation of so-called
‘big data’ will enable us to undertake research projects never previously possible but should also
stimulate a re-evaluation of all our data practices". [18]
Peter Tseng, Westbrook M. Weaver, Mahdokht Masaeli, Keegan Owsley and Dino Di Carlo highlighted a
study that the development of large scale mutant or fusion libraries, automation of microscopy, image
analysis, and data extraction will be key components as microfluidics. [19] Nicholas P. Restifo in his
paper in 2013 admits that this ‘‘Big Data Revolution’’ is likely to affect fields as diverse as weather
forecasting and crime fighting, and the conduct of immunobiology is certainly no exception. [20] Marton
Mestya, Taha Yasseri & Janos Kertesz proved that the popularity of a movie can be predicted much
before its release using bigdata by measuring and analyzing the activity level of editors and viewers of the
corresponding entry to the movie in Wikipedia, the well-known online encyclopedia. [21] In Feb 2014,
Wenliang Huang, Zhen Chen, Wenyu Dong, Hang Li, Bin Cao, and Junwei Cao presented a comparison
of HBase and Oracle, "Compared with Oracle database, our HBase shows very consistent performance,
and the peak insertion rate reaches approximately 100 000 records per second". [22]
Min Chen, Shiwen Mao and Yunhao Liu states that Traditional relational databases cannot meet the
challenges on categories and scales brought about by big data. NoSQL databases (i.e., nontraditional
relational databases) are becoming more popular for big data storage. NoSQL databases feature flexible
modes, support for simple and easy copy, simple API, eventual consistency, and support of large volume
data. NoSQL databases are becoming the core technology for of big data. [23] Raghupati & Raghupati
stated in their paper that McKinsey estimates that big data analytics can enable more than $300 billion in
savings per year in U.S. healthcare, two thirds of that through reductions of approximately 8% in national
healthcare expenditures. Clinical operations and R & D are two of the largest areas for potential savings
with $165 billion and $108 billion in waste respectively. McKinsey believes big data could help reduce
Ms. Subita Kumari: Big Data: A Detailed Review
www.ijifr.com Email: [email protected] © IJIFR 2013
This paper is available online at - http://www.ijifr.com/searchjournal.aspx
PAPER ID: IJIFR/V1/E7/017
ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014
Author’s Subject Area: Big Data, Page No.: 22-33
27
waste and inefficiency in the following three areas: Clinical operations, Research & development and
Public Health [24].
3. The development of big data
In the late 1970s, the concept of “database machine” emerged, In the 1980s, people proposed “share
nothing,” a parallel database system, Teradata system was the first successful commercial parallel
database system. In the late 1990s, the advantages of parallel database was widely recognized in the
database field. Taking IBM as an example, since 2005, IBM has invested USD 16 billion on 30
acquisitions related to big data. In the beginning of 2012, a report titled Big Data, Big Impact presented at
the Davos Forum in Switzerland, announced that big data has become a new kind of economic assets, just
like currency or gold. Gartner, an international research agency, classified big data computing, social
analysis, and stored data analysis into 48 emerging technologies that deserve most attention.
Fundamental technologies that are closely related to big data, includes cloud computing, Internet Of
Things (IoT), data center, and Hadoop.
4. Big data generation and acquisition
Value chain of big data can be generally divided into four phases: data generation, data acquisition, data
storage, and data analysis. If we take data as a raw material, data generation and data acquisition are an
exploitation process, data storage is a storage process and data analysis is a production process that
utilizes the raw material to create new value.
4.1. Data generation
Data generation is the first step of big data. Given Internet data as an example, huge amount of data in
terms of searching entries, Internet forum posts, chatting records, and microblog messages, are generated.
Data could be generated using Enterprise Data, IoT Data, Bio Medical Data, Data Generation from Other
fields. [23]
4.2. Big data acquisition
Big data acquisition includes data collection, data transmission, and data pre-processing. During big data
acquisition, once we collect the raw data, we shall utilize an efficient transmission mechanism to send it
to a proper storage management system to support different analytical applications.
4.3. Data collection
Data collection is to utilize special data collection techniques to acquire raw data from a specific data
generation environment. Common data collection methods are Log Files, Sensing & Methods for
acquiring network data.The current network data acquisition technologies mainly include traditional
Libpcap-based packet capture technology, zero-copy packet capture technology, as well as some
specialized network monitoring software such as Wireshark, SmartSniff, and WinNetCap.Data
transportation Upon the completion of raw data collection, data will be transferred to a data storage
infrastructure for processing and analysis. Big data is mainly stored in a data center. Data transmission
consists of two phases: Inter-DCN transmissions and Intra-DCN transmissions. As a strengthening
Ms. Subita Kumari: Big Data: A Detailed Review
www.ijifr.com Email: [email protected] © IJIFR 2013
This paper is available online at - http://www.ijifr.com/searchjournal.aspx
PAPER ID: IJIFR/V1/E7/017
ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014
Author’s Subject Area: Big Data, Page No.: 22-33
28
technology, Zhou et al. in [30] adopt wireless links in the 60GHz frequency band to strengthen wired
links.
4.4. Data pre-processing
Because of the wide variety of data sources, the collected datasets vary with respect to noise, redundancy,
and consistency, etc., and it is undoubtedly a waste to store meaningless data. Some relational data pre-
processing techniques are Integration, Cleaning & Redundancy elimination. Generally, data integration
methods are accompanied with flow processing engines and search engines [31]. Authors in [33]
discussed data cleaning in e-commerce by crawlers and regularly re-copying customer and account
information. Apart from the data pre-processing methods, specific data objects shall go through some
other operations such as feature extraction. Such operation plays an important role in multimedia search
and DNA analysis [37-39].
5. Big data storage
5.1. Storage mechanism for big data
Existing storage mechanisms of big data may be classified into three bottom-up levels: (i) file systems,
(ii) databases, and (iii) programming models.
5.2. File Systems
Google’s GFS is an expandable distributed file system to support large-scale, distributed, data-intensive
applications [26]. GFS uses cheap commodity servers to achieve fault-tolerance and provides customers
with high performance services. However, GFS also has some limitations, such as a single point of failure
and poor performances for small files. Such limitations have been overcome by Colossus, the successor of
GFS. Other examples are HDFS, Kosmosfs, Cosmos, Haystack etc. Microsoft developed Cosmos [44] to
support its search and advertisement business. Facebook utilizes Haystack [45] to store the large amount
of small-sized photos.
5.3. Database technology
Relational databases cannot meet the challenges on categories and scales brought about by big data. Eric
Brewer proposed a CAP [41,42] theory in 2000, which indicated that a distributed system could not
simultaneously meet the requirements on consistency, availability, and partition tolerance; at most two of
the three requirements can be satisfied simultaneously. NoSQL databases (i.e., nontraditional relational
databases) are becoming more popular for big data storage. NoSQL databases feature flexible modes,
support for simple and easy copy, simple API, eventual consistency, and support of large volume data.
Three main NoSQL databases are: Key-value databases, column-oriented databases, and document-
oriented databases.
5.4 Key-value Databases: Key-value Databases are constituted by a simple data model and data is
stored corresponding to key-values. Every key is unique and customers may input queried values
according to the keys. Over the past few years, many key-value databases have appeared as motivated by
Ms. Subita Kumari: Big Data: A Detailed Review
www.ijifr.com Email: [email protected] © IJIFR 2013
This paper is available online at - http://www.ijifr.com/searchjournal.aspx
PAPER ID: IJIFR/V1/E7/017
ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014
Author’s Subject Area: Big Data, Page No.: 22-33
29
Amazon’s Dynamo system [46,47]. Other examples are Redis, Tokyo Canbinet and Tokyo Tyrant,
Memcached and Memcache DB, Riak and Scalaris etc.
5.5 Column-oriented Database: The column-oriented databases store and process data according to
columns other than rows. Both columns and rows are segmented in multiple nodes to realize
expandability. The column-oriented databases are mainly inspired by Google’s BigTable. BigTable is
based on many fundamental components of Google, including GFS [26], cluster management system,
SSTable file format, and Chubby [49]. HBase is a BigTable cloned version programmed with Java and is
a part of Hadoop of Apache’s MapReduce framework. HBase replaces GFS with HDFS. Cassandra is a
distributed storage system to manage the huge amount of structured data distributed among multiple
commercial servers [50].
5.6 Document Database:
Compared with key-value storage, document storage can support more complex data forms. Three
important representatives of document storage systems are, MongoDB, SimpleDB, and CouchDB.
MongoDB stores documents as Binary JSON (BSON) objects, which is similar to object. Every document
has an ID field as the primary key. SimpleDB: SimpleDB is a distributed database and is a web service of
Amazon. Data in SimpleDB is organized into various domains in which data may be stored, acquired, and
queried. Domains include different properties and name/value pair sets of projects.CouchDB: Apache
CouchDB is a documentoriented database written in Erlang. Data in CouchDB is organized into
documents consisting of fields named by keys/names and values, which are stored and accessed as JSON
objects. Recently, some proposed parallel programming models effectively improve the performance of
NoSQL and reduce the performance gap to relational databases. MapReduce [27] is a simple but powerful
programming model for large-scale computing. It has two functions, i.e., Map and Reduce. The Map
function processes input key-value pairs and generates intermediate key-value pairs. Then, MapReduce
will combine all the intermediate values related to the same key and transmit them to the Reduce function,
which further compress the value set into a smaller set. The initial MapReduce framework did not support
multiple datasets in a task, which has been mitigated by some recent enhancements [51].
In order to improve the programming efficiency, some advanced language systems have been proposed,
e.g., Sawzall [52] of Google, Pig Latin [53] of Yahoo, Hive [54] of Facebook, and Scope [55]of
Microsoft.
6. Big data analysis
Data analysis is the final and the most important phase in the value chain of big data, with the purpose of
extracting useful values, providing suggestions or decisions.
6.1. Traditional data analysis
Traditional Data analysis approaches are: Cluster Analysis, Factor Analysis, Correlation Analysis,
Regression Analysis, A/B Testing, Statistical Analysis and Data Mining Algorithms. Big data analytic
methods Big Data analysis approaches are : Bloom Filter, Hashing, Index, Triel & Parallel Computing.
Ms. Subita Kumari: Big Data: A Detailed Review
www.ijifr.com Email: [email protected] © IJIFR 2013
This paper is available online at - http://www.ijifr.com/searchjournal.aspx
PAPER ID: IJIFR/V1/E7/017
ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014
Author’s Subject Area: Big Data, Page No.: 22-33
30
6.2. Tools for big data mining and analysis
Major tools used for big data analysis and mining are R (30.7%), Excel (29.8%), Rapidminer (26.7 %),
KNMINE (21.8 %) & Weka/Pentaho (14.8 %).
Figure 3 depicts the complete Big Data architecture. Key Applications of Big Data are: Big Data in
enterprises, IoT based big data, online social network-oriented big data, healthcare and medical big data,
Collective intelligence & Smart grid. Analysis of Big Data could be classified under: Structured data
analysis, Text data analysis, Web data analysis, Multimedia data analysis, Network data analysis &
Mobile data analysis.
Figure 3: BigData architecture
7. Conclusion and Open Issues
The analysis of big data is confronted with many challenges, but the current research is still in early stage.
Considerable research efforts are needed to improve the efficiency of display, storage, and analysis of big
data. At present, many discussions of big data look more like commercial speculation than scientific
research. This is because big data is not formally and structurally defined and the existing models are not
strictly verified. An evaluation system of data quality and an evaluation standard/benchmark of data
computing efficiency should be developed. The big data technology is still in its infancy. Many key
Ms. Subita Kumari: Big Data: A Detailed Review
www.ijifr.com Email: [email protected] © IJIFR 2013
This paper is available online at - http://www.ijifr.com/searchjournal.aspx
PAPER ID: IJIFR/V1/E7/017
ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014
Author’s Subject Area: Big Data, Page No.: 22-33
31
technical problems, such as cloud computing, grid computing, stream computing, parallel computing, big
data architecture, big data model, and software systems supporting big data, etc. should be fully
investigated. In IT, safety and privacy are always two key concerns. In the big data era, as data volume is
fast growing, there are more severe safety risks, while the traditional data protection methods have
already been shown not applicable to big data. In particular, big data safety is confronted with the
following security related challenges: Big data privacy, Data quality, Big data safety mechanism & Big
data application in information security. Particularly, big data security, including credibility, backup and
recovery, completeness maintenance, and security should be further investigated.
8. References
[1] Gantz J , Reinsel D , "Extracting value from chaos" , IDC iView , pp 1-12(2011)
[2] Mango DB, “Top 5 considerations when evaluating NoSQL Databases”, White Paper.
[3] Danah boyd & Kate Crawford , “CRITICAL QUESTIONS FOR BIG DATA”,Routledge, Information,
Communication & Society Vol. 15, No. 5, June 2012, pp. 662–679 ISSN 1369-118 (2012).
[4] Moniruzzaman, A. B. M.; Hossain, Syed Akhter , “NoSQL Database: New Era of Databases for Big data
Analytics-Classification, Characteristics and Comparison”, Source: International Journal of Database Theory &
Application. Vol. 6 Issue 4, pp-13. (2013).
[5] Jongwook Woo, Siddharth Basopia, Yuhang Xu, “Market Basket Analysis Algorithm with no-SQL DB HBase
and Hadoop”, Computer Information Systems Department California State University, Los Angeles, CA, USA.
[6] Vatika Sharma, Meenu Dave, “SQL and NoSQL Databases”, International Journal of Advanced Research in
Computer Science and Software Engineering, Volume 2, Issue 8, August 2012 ISSN: 2277 128X.
[7] Kailash Raj Joshi , “GRAPH VISUALIZATION USING THE NoSQL DATABASE”, Paper Submitted to the
Graduate Faculty of the North Dakota State University of Agriculture and Applied Science.
[8] CouchBase, “NoSQL”, White Paper
[9] Arto Salminen, “Introduction to NoSQL”, NoSQL Seminar 2012 @ TUT.
[10] Big Data’s Impact in the World - NYTimes.com.
[11] Hsinchun Chen , Roger H. L , Veda C., “BUSINESS INTELLIGENCE AND ANALYTICS: FROM BIG
DATA TO BIG IMPACT”, MIS Quarterly Vol. 36 No. 4/December 2012,Eller College of Management, University
of Arizona, Tucson, AZ 85721 U.S.A.(2012).
[12] TATA Consultancy Services, “NoSQL, the Database for the Cloud”, White Paper.
[13] Darshana Shimpi, Sangita Chaudhari , “An overview of Graph Databases”, International Conference in Recent
Trends in Information Technology and Computer Science (ICRTITCS - 2012),
Proceedings published in International Journal of Computer Applications® (IJCA) (0975 – 8887)(2012).
[14] Renu Kanwar, Prakriti Trivedi, Kuldeep Singh , “NoSQL, a Solution for Distributed Database Management
System”, International Journal of Computer Applications (0975 – 8887) Volume 67– No.2, (2013).
[15] Mirko Kampf, Jan W. Kantelhardt , “Hadoop.TS: Large-Scale Time-Series Processing”, International Journal
of Computer Applications (0975 - 8887) Volume 74 - No. 17, (2013).
[16] David Navetta , “Legal Implications of BIG DATA - A Primer”, ISSA - Developing And Connecting
CyberSecurity Leaders Globally, ISSA member, Denver, USA Chapter.
[17] Bishop M, “Caution: Danger Ahead with Big Data”, ISSA DEVELOPING AND CONNECTING
CYBERSECURITY LEADERS GLOBALLY
[18] Scott J. Lusher, Ross McGuire, Renevan Schaik, David Nicholson and JacobdeVlieg , “Data-driven medicinal
chemistry in the era of bigdata”, ELSEVIER DrugDiscoveryTodayVolume00,Number00Decembe (2013).
[19] Peter Tseng, Westbrook M. Weaver, Mahdokht Masaeli, Keegan Owsley and Dino Di Carlo, “ Research
highlights: microfluidics meets big data”, Royal Society of Chemistry - www.rsc.org/loc
[20] Nicholas P. Restifo, “Big Data” View of the Tumor ‘‘Immunome’’”, Immunity Previews - Cell Press, Center
for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
Ms. Subita Kumari: Big Data: A Detailed Review
www.ijifr.com Email: [email protected] © IJIFR 2013
This paper is available online at - http://www.ijifr.com/searchjournal.aspx
PAPER ID: IJIFR/V1/E7/017
ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014
Author’s Subject Area: Big Data, Page No.: 22-33
32
[21] Marton Mestya, Taha Yasseri,Janos Kertesz , “Early Prediction of Movie Box Office Success Based on
Wikipedia Activity Big Data”, PLOS
[22] Wenliang Huang, Zhen Chen, Wenyu Dong, Hang Li, Bin Cao, and Junwei Cao, “Mobile Internet Big Data
Platform in China Unicom”, TSINGHUA SCIENCE AND TECHNOLOGY, ISSN 1007-0214 10/10 pp95-101
Volume 19, Number 1, (2014)
[23] Min Chen, Shiwen Mao and Yunhao Liu, “Big Data: A Survey”, Springer, School of Computer Science and
Technology, Huazhong University of Science and Technology, 1037 Luoyu Road, Wuhan, 430074, China. (2014).
[24] Raghupathi and Raghupathi, “Big data analytics in healthcare: promise and potential”, Health Information
Science and Systems, http://www.hissjournal.com/content/2/1/3 (2014).
[25] Manyika J, McKinsey Global Institute, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH, “Big
data: the next frontier for innovation, competition, and productivity”, McKinsey Global Institute (2011).
[26] Ghemawat S, Gobioff H, Leung S-T, “The google file system. In: ACM SIGOPS Operating Systems Review”,
vol 37. ACM, pp 29–43.
[27] Dean J, Ghemawat S, “Mapreduce: simplified data processing on large clusters”, Commun ACM 51(1):107–
113.
[28] Singla A, Singh A, Ramachandran K, Xu L, Zhang Y, “Proteus: a topology malleable data center network”, In
Proceedings of the 9th ACM SIGCOMM workshop on hot topics in networks. ACM, pp 8(2010).
[29] Liboiron-Ladouceur O, Cerutti I, Raponi PG, Andriolli N, Castoldi P, “Energy-efficient design of a scalable
optical multiplane interconnection architecture”. IEEE J Sel Top Quantum Electron 17(2):377–383(2011).
[30] Zhou X, Zhang Z, Zhu Y, Li Y, Kumar S, Vahdat A, Zhao BY, Zheng H, “Mirror mirror on the ceiling: flexible
wireless links for data centers”. ACM SIGCOMM Comput Commun Rev 42(4):443–454(2012).
[31] Cafarella MJ, Halevy A, Khoussainova N, “Data integration for the relational web”, Proc VLDB Endowment
2(1):1090– 1101(2009).
[32] Maletic JI, Marcus, “A Data cleansing: beyond integrity analysis”, In: IQ. Citeseer, pp 200–209(2000).
[33] Kohavi R, Mason L, Parekh R, Zheng Z, “Lessons and challenges from mining retail e-commerce data”, Mach
Learn 57(1-2):83–113(2004).
[34] Chen H, Ku W-S, Wang H, Sun M-T, “Leveraging spatiotemporal redundancy for rfid data cleansing”, In:
Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 51–62(2010).
[35] Zhao Z, Ng W, “A model-based approach for rfid data stream cleansing”, In Proceedings of the 21st ACM
international conference on information and knowledge management. ACM, pp 862–871(2012).
[36] Khoussainova N, Balazinska M, Suciu D, “Probabilistic event extraction from rfid data”, In: Data Engineering,
2008. IEEE 24th international conference on ICDE 2008. IEEE, pp 1480–1482(2008).
[37] Kamath U, Compton J, Dogan RI, Jong KD, Shehu A, “An evolutionary algorithm approach for feature
generation from sequence data and its application to dna splice site prediction”, IEEE/ACM Transac Comput Biol
Bioinforma (TCBB) 9(5):1387–1398(2012).
[38] Leung K-S, Lee KH, Wang J-F, Ng EYT, Chan HLY, Tsui SKW, Mok TSK, Tse PC-H, Sung JJ-Y, “Data
mining on dna sequences of hepatitis b virus”, IEEE/ACM Transac Comput Biol Bioinforma 8(2):428–440(2011).
[39] Huang Z, Shen H, Liu J, Zhou X, “Effective data coreduction for multimedia similarity search”, In Proceedings
of the 2011 ACM SIGMOD International Conference on Management of data. ACM, pp 1021–1032(2011).
[40] Bleiholder J, Naumann F, “Data fusion”, ACMComput Surv (CSUR) 41(1):1(2008).
[41] Brewer EA, “Towards robust distributed systems”, In: PODC. p 7(2000).
[42] Gilbert S, Lynch N Brewer’s, “conjecture and the feasibility of consistent, available, partition-tolerant web
services”, ACM SIGACT News 33(2):51–59(2002).
[43] McKusick MK, Quinlan S, “Gfs: eqvolution on fastforward”, ACM Queue 7(7):10(2009).
[44] Chaiken R, Jenkins B, Larson P° A, Ramsey B, Shakib D, Weaver S, Zhou J, “Scope: easy and efficient
parallel processing of massive data sets”, Proc VLDB Endowment 1(2):1265– 1276(2008).
[45] Beaver D, Kumar S, Li HC, Sobel J, Vajgel P, “Finding a needle in haystack: facebook’s photo storage”, In
OSDI, vol 10. pp 1–8(2010).
[46] DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P,
Vogels W, “Dynamo: amazon’s highly available key-value store”, In: SOSP, vol 7. pp 205–220 (2007).
Ms. Subita Kumari: Big Data: A Detailed Review
www.ijifr.com Email: [email protected] © IJIFR 2013
This paper is available online at - http://www.ijifr.com/searchjournal.aspx
PAPER ID: IJIFR/V1/E7/017
ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014
Author’s Subject Area: Big Data, Page No.: 22-33
33
[47] Karger D, Lehman E, Leighton T, Panigrahy R, Levine M, Lewin D, “Consistent hashing and random trees:
distributed caching protocols for relieving hot spots on the world wide web”, In: Proceedings of the twenty-ninth
annual ACM symposium on theory of computing. ACM, pp 654–663 87. (1997).
[48] Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE,
“Bigtable: a distributed storage system for structured data”, ACM Trans Comput Syst (TOCS) 26(2):4(2008).
[49] Burrows M, “The chubby lock service for loosely-coupled distributed systems”, In: Proceedings of the 7th
symposium on Operating systems design and implementation. USENIX Association, pp 335–350(2006).
[50] Lakshman A, Malik P, “Cassandra: structured storage system on a p2p network”, In: Proceedings of the 28th
ACM symposium on principles of distributed computing. ACM, pp 5–5(2009).
[51] Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y, “A comparison of join algorithms for log
processing in mapreduce”, In: Proceedings of the 2010 ACM SIGMOD international conference on management of
data. ACM, pp 975–986 (2010).
[52] Pike R, Dorward S, Griesemer R, Quinlan S, “Interpreting the data: parallel analysis with sawzall”, Sci
Program 13(4):277– 298(2005).
[53] Gates AF, Natkovich O, Chopra S, Kamath P, Narayanamurthy SM, Olston C, Reed B, Srinivasan S, Srivastava
U, “Building a high-level dataflow system on top of map-reduce: the pig experience”, Proceedings VLDB
Endowment 2(2):1414–1425(2009).
[54] Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R, “Hive: a
warehousing solution over a map-reduce framework”, Proc VLDB Endowment 2(2):1626–1629 (2009).
[55] Isard M, Budiu M, Yu Y, Birrell A, Fetterly D, “Dryad: distributed data-parallel programs from sequential
building blocks”, ACM SIGOPS Oper Syst Rev 41(3):59–72 (2007).