+ All Categories

1

Date post: 27-Jan-2015
Category:
Upload: monika-moni
View: 109 times
Download: 1 times
Share this document with a friend
Description:
 
Popular Tags:
30
1. INTRODUCTION 1.1 About BigData: Big data is a buzz word, or catch-phrase, used to describe a massive volume of both structured and unstructured data that it is so large that it is difficult to process using traditional database and software techniques. In most enterprise scenarios the data is too big or moves too fast or it exceeds current processing capacity. While the term may seem to reference the volume of data, that isn’t always the case. The term, Big data, especially when used by vendors, may refer to the technology (which includes tools and processes) that an organization requires to maintain large amounts of data and storage facilities. The term Big Data is believed to have originated with the web search companies who had to query very large distributed aggregations of loosely structured data. Big data has become viable as cost-effective approaches have emerged to tame the volume, velocity and variability of massive data. Within this data lies the valuable patterns and information, previously hidden because of the amount of work required to extract them. To leading corporations, such as Walmart or Google, this power has been in reach for some time, but at fantastic cost. Today’s commodity hardware, cloud architectures and open source software bring big data processing into the reach of the less well- 1
Transcript
Page 1: 1

1. INTRODUCTION

1.1 About BigData:

Big data is a buzz word, or catch-phrase, used to describe a massive volume of both structured

and unstructured data that it is so large that it is difficult to process using traditional database and

software techniques. In most enterprise scenarios the data is too big or moves too fast or it

exceeds current processing capacity.

While the term may seem to reference the volume of data, that isn’t always the case. The term,

Big data, especially when used by vendors, may refer to the technology (which includes tools

and processes) that an organization requires to maintain large amounts of data and storage

facilities.

The term Big Data is believed to have originated with the web search companies who had to

query very large distributed aggregations of loosely structured data. Big data has become viable

as cost-effective approaches have emerged to tame the volume, velocity and variability of

massive data. Within this data lies the valuable patterns and information, previously hidden

because of the amount of work required to extract them. To leading corporations, such as

Walmart or Google, this power has been in reach for some time, but at fantastic cost. Today’s

commodity hardware, cloud architectures and open source software bring big data processing

into the reach of the less well-resourced. Big data processing is eminently feasible for even the

small garage startups, who can cheaply rent server time in the cloud.

1.2 An Example of Big Data:

An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of

data consisting of billions to trillions of records of millions of people—all from different sources

(e.g. Web, sales, customer contact center, social media, mobile data and so on). The data is

typically loosely structured data that is often incomplete and inaccessible.

When dealing with larger datasets, organizations face difficulties in being able to create,

manipulate, and manage big data. Big data is particularly a problem in business analytics because

standard tools and procedures are not designed to search and analyze massive datasets.

1

Page 2: 1

1.3 SIZE OF BIGDATA:

The social networking sites, need to process data of huge size on a

daily basis. An example of such big data processed on daily basis is given as follows:

2

Page 3: 1

1.4 CHARACTERISTICS OF BIG DATA

The characteristics of big data are as follows:

Volume.

 A typical PC might have had 10 gigabytes of storage in 2000. Today, Facebook ingests

500 terabytes of new data every day; a Boeing 737 will generate 240 terabytes of flight

data during a single flight across the US; the proliferation of smart phones, the data they

create and consume; sensors embedded into everyday objects will soon result in billions

of new, constantly-updated data feeds containing environmental, location, and other

information, including video

Velocity.

 Click streams and ad impressions capture user behavior at millions of events per second;

high-frequency stock trading algorithms reflect market changes within microseconds;

machine to machine processes exchange data between billions of devices; infrastructure

and sensors generate massive log data in real-time; on-line gaming systems support

millions of concurrent users, each producing multiple inputs per second.

Variety.

Big Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data,

audio and video, and unstructured text, including log files and social media. Traditional

database systems were designed to address smaller volumes of structured data, fewer

updates or a predictable, consistent data structure. Traditional database systems are also

designed to operate on a single server, making increased capacity expensive and finite.

As applications have evolved to serve large volumes of users, and as application

development practices have become agile, the traditional use of the relational database

has become a liability for many companies rather than an enabling factor in their

business. Big Data databases, such as Mongo DB, solve these problems and provide

companies with the means to create tremendous business value.

3

Page 4: 1

2. BIGDATA ANALYTICS

Big data analytics is the process of examining large amounts of data of a variety of types (big

data) to uncover hidden patterns, unknown correlations and other useful information. Such

information can provide competitive advantages over rival organizations and result in business

benefits, such as more effective marketing and increased revenue.

2.1 GOAL OF BIGDATA ANALYTICS:

The primary goal of big data analytics is to help companies make better business decisions by

enabling data scientists and other users to analyze huge volumes of transaction data as well as

other data sources that may be left untapped by conventional business intelligence (BI)programs.

These other data sources may include Web server logs, social media activity reports, mobile-

phone call detail records and information captured by sensors. Some people exclusively associate

big data and big data analytics with unstructured of that sort, but consulting firms like Gartner

Inc. and Forrester Research Inc. also consider transactions and other structured data to be valid

forms of big data.

2.2 TECHNOLOGIES ASSOCIATED WITH BIGDATA:

Big data analytics can be done with the software tools commonly used as part of advanced

analytics disciplines such as predictive analytics and data mining. But the unstructured data

sources used for big data analytics may not fit in traditional data warehouses. Furthermore,

traditional data warehouses may not be able to handle the processing demands posed by big data.

As a result, a new class of big data technology has emerged and is being used in many big data

analytics environments. The technologies associated with big data analytics include

NoSQL databases, Hadoop and Map Reduce. These technologies form the core of an open

source software framework that supports the processing of large data sets across clustered

systems.

4

Page 5: 1

2.3 Challenges in Big Data Analysis

Heterogeneity and Incompleteness: When humans consume information, a great deal of

heterogeneity is comfortably tolerated. In fact, the nuance and richness of natural language can

provide valuable depth. However, machine analysis algorithms expect homogeneous data, and

cannot understand nuance. In consequence, data must be carefully structured as a first step in (or

prior to) data analysis

Even after data cleaning and error correction, some incompleteness and some errors in data are

likely to remain. This incompleteness and these errors must be managed during data analysis.

Doing this correctly is a challenge.

Scale: Of course, the first thing anyone thinks of with Big Data is its size. After all, the word

“big” is there in the very name. Managing large and rapidly increasing volumes of data has been

a challenging issue for many decades.

Timeliness: The larger the data set to be processed, the longer it will take to analyze. The design

of a system that effectively deals with size is likely also to result in a system that can process a

given size of data set faster.

Privacy: The privacy of data is another huge concern, and one that increases in the context of

Big Data. For electronic health records, there are strict laws governing what can and cannot be

done. However, there is great public fear regarding the inappropriate use of personal data,

particularly through linking of data from multiple sources. Managing privacy is effectively both

a technical and a sociological problem, which must be addressed jointly from both perspectives

to realize the promise of big data.

5

Page 6: 1

2.4 USES OF BIGDATA ANALYTICS:

Enable data analysts to rapidly produce insights with no IT involvement

Connect to any data source of any data type using a simple, guided interface

Utilize 220+ built-in functions to quickly and easily analyze your data

Create scenarios such as market compensation comparison, global payroll cost analysis,

retention risk and impact analysis, competitive benchmark, revenue pulse, and more

6

Page 7: 1

3.BIG DATA TECHNOLOGY

3.1 Selecting a Big Data Technology: Operational vs. Analytical

The Big Data landscape is dominated by two classes of technology: systems that provide

operational capabilities for real-time, interactive workloads where data is primarily captured and

stored; and systems that provide analytical capabilities for retrospective, complex analysis that

may touch most or all of the data. These classes of technology are complementary and frequently

deployed together.

Operational and analytical workloads for Big Data present opposing requirements and systems

have evolved to address their particular demands separately and in very different ways. Each has

driven the creation of new technology architectures. Operational systems, such as the NoSQL

databases, focus on servicing highly concurrent requests while exhibiting low latency for

responses operating on highly selective access criteria. Analytical systems, on the other hand,

tend to focus on high throughput; queries can be very complex and touch most if not all of the

data in the system at any time. Both systems tend to operate over many servers operating in a

cluster, managing tens or hundreds of terabytes of data across billions of records.

OPERATIONAL BIG DATA

For operational Big Data workloads, NoSQL Big Data systems such as document databases have

emerged to address a broad set of applications, and other architectures, such as key-value stores,

column family stores, and graph databases are optimized for more specific applications. NoSQL

technologies, which were developed to address the shortcomings of relational databases in the

modern computing environment, are faster and scale much more quickly and inexpensively than

relational databases.

Critically, NoSQL Big Data systems are designed to take advantage of new cloud computing

architectures that have emerged over the past decade to allow massive computations to be run

inexpensively and efficiently. This makes operational Big Data workloads much easier to

manage, and cheaper and faster to implement.

7

Page 8: 1

In addition to user interactions with data, most operational systems need to provide some degree

of real-time intelligence about the active data in the system. For example in a multi-user game or

financial application, aggregates for user activities or instrument performance are displayed to

users to inform their next actions. Some NoSQL systems can provide insights into patterns and

trends based on real-time data with minimal coding and without the need for data scientists and

additional infrastructure.

ANALYTICAL BIG DATA

Analytical Big Data workloads, on the other hand, tend to be addressed by MPP database

systems and MapReduce. These technologies are also a reaction to the limitations of traditional

relational databases and their lack of ability to scale beyond the resources of a single server.

Furthermore, MapReduce provides a new method of analyzing data that is complementary to the

capabilities provided by SQL.

As applications gain traction and their users generate increasing volumes of data, there are a

number of retrospective analytical workloads that provide real value to the business. Where these

workloads involve algorithms that are more sophisticated than simple aggregation, MapReduce

has emerged as the first choice for Big Data analytics. Some NoSQL systems provide native

MapReduce functionality that allows for analytics to be performed on operational data in place.

Alternately, data can be copied from NoSQL systems into analytical systems such as Hadoop for

MapReduce.

OVERVIEW OF OPERATIONAL VS. ANALYTICAL SYSTEMS

Operational Analytical

Latency 1 ms - 100 ms 1 min - 100 min

Concurrency 1000 - 100,000 1 - 10

Access Pattern Writes and Reads Reads

Queries Selective Unselective

8

Page 9: 1

Operational Analytical

Data Scope Operational Retrospective

End User Customer Data Scientist

Technology NoSQL MapReduce, MPP Database

Combining Operational and Analytical Technologies; Using Hadoop

New technologies like NoSQL, MPP databases, and Hadoop have emerged to address Big Data

challenges and to enable new types of products and services to be delivered by the business. One

of the most common ways companies are leveraging the capabilities of both systems is by

integrating a NoSQL database such as MongoDB with Hadoop. The connection is easily made

by existing APIs and allows analysts and data scientists to perform complex, retroactive queries

for Big Data analysis and insights while maintaining the efficiency and ease-of-use of a NoSQL

database.

NoSQL, MPP databases and Hadoop are complementary: NoSQL systems should be used to

capture Big Data and provide operational intelligence to users, and MPP databases and Hadoop

should be used to provide analytical insight for analysts and data scientists. Together, NoSQL,

MPP databases and Hadoop enable businesses to capitalize on Big Data.

3.2 Considerations for Decision Makers

While many Big Data technologies are mature enough to be used for mission-critical, production

use cases, it is still nascent in some regards. Accordingly, the way forward is not always clear.

As organizations develop Big Data strategies, there are a number of dimensions to consider when

9

Page 10: 1

selecting technology partners, including: 1. Online vs. Offline Big Data 2. Software Licensing

Models 3. Community 4. Developer Appeal 5. Agility 6. General Purpose vs. Niche Solutions

1. ONLINE VS. OFFLINE BIG DATA

Big Data can take both online and offline forms. Online Big Data refers to data that is created,

ingested, trans- formed, managed and/or analyzed in real-time to support operational applications

and their users. Big Data is born online. Latency for these applications must be very low and

availability must be high in order to meet SLAs and user expectations for modern application

performance. This includes a vast array of applications, from social networking news feeds, to

analytics to real-time ad servers to complex CRM applications. Examples of online Big Data

databases include MongoDB and other NoSQL databases.

Offline Big Data encompasses applications that ingest, transform, manage and/or analyze Big

Data in a batch context. They typically do not create new data. For these applications, response

time can be slow (up to hours or days), which is often acceptable for this type of use case. Since

they usually produce a static (vs. operational) output, such as a report or dashboard, they can

even go offline temporarily without impacting the overall goal or end product. Examples of

offline Big Data applications include Hadoop-based workloads; modern data warehouses;

extract, transform, load (ETL) applications; and business intelligence tools.

Organizations evaluating which Big Data technologies to adopt should consider how they intend

to use their data. For those looking to build applications that support real-time, operational use

cases, they will need an operational data store like MongoDB. For those that need a place to

conduct long-running analysis offline, perhaps to inform decision-making processes, offline

solutions like Hadoop can be an effective tool. Organizations pursuing both use cases can do so

in tandem, and they will sometimes find integrations between online and offline Big Data

technologies. For instance, MongoDB provides integration with Hadoop.

2. SOFTWARE LICENSE MODEL

There are three general types of licenses for Big Data software technologies:

Proprietary. The software product is owned and controlled by a software company. The

source code is not available to licensees. Customers typically license the product through

10

Page 11: 1

a perpetual license that entitles them to indefinite use, with annual maintenance fees for

support and software upgrades. Examples of this model include databases from Oracle,

IBM and Terradata.

Open-Source. The software product and source code are freely available to use.

Companies monetize the software product by selling subscriptions and adjacent products

with value-added components, such as management tools and support services. Examples

of this model include MongoDB (by MongoDB, Inc.) and Hadoop (by Cloudera and

others).

Cloud Service. The service is hosted in a cloud- based environment outside of

customers’ data centers and delivered over the public Internet. The predominant business

model is metered (i.e., pay-per-use) or subscription-based. Examples of this model

include Google App Engine and Amazon Elastic MapReduce.

For many Fortune 1000 companies, regulations and internal policies around data privacy limit

their ability to leverage cloud-based solutions. As a result, most Big Data initiatives are driven

with technologies deployed on-premise. Most of the Big Data pioneers are web companies that

developed powerful software and hardware, which they open-sourced to the larger community.

Accordingly, most of the software used for Big Data projects is open-source.

3. COMMUNITY

In these early days of Big Data, there is an opportunity to learn from others. Organizations

should consider how many other initiatives are being pursued using the same technologies and

with similar objectives. To understand a given technology’s adoption, organiza- tions should

consider the following:

The number of users

The prevalence of local, community-organized events

The health and activity of online forums such as Google Groups and StackOverflow

The availability of conferences, how frequently they occur and whether they are well-

attended

11

Page 12: 1

4. DEVELOPER APPEAL

The market for Big Data talent is tight. The nation’s top engineers and data scientists often flock

to companies like Google and Facebook, which are known havens for the brightest minds and

places where one will be exposed to leading edge technology. If enterprises want to compete for

this talent, they have to offer more than money.

By offering developers the opportunity to work on tough problems, and by using a technology

that has strong developer interest, a vibrant community, and an auspicious long-term future,

organizations can attract the brightest minds. They can also increase the pool of candidates by

choosing technologies that are easy to learn and use — which are often the ones that appeal most

to developers. Furthermore, technologies that have strong developer appeal tend to make for

more productive teams who feel they are empowered by their tools rather than encumbered by

poorly-designed, legacy technology. Productive developer teams reduce time to market for new

initiatives and reduce development costs, as well.

5. AGILITY

Organizations should use Big Data products that enable them to be agile. They will benefit from

technologies that get out of the way and allow teams to focus on what they can do with their

data, rather than how to deploy new applications and infrastructure. This will make it easy to

explore a variety of paths and hypotheses for extracting value from the data and to iterate quickly

in response to changing business needs.

In this context, agility comprises three primary components:

Ease of Use. A technology that is easy for developers to learn and understand -- either

because of the way it’s architected, the availability of tools and information, or both --

will enable teams to get Big Data projects started and to realize value quickly.

Technologies with steep learning curves and fewer resources to support education will

make for a longer road to project execution.

Technological Flexibility. The product should make it relatively easy to change

requirements on the fly—such as how data is modeled, which data is used, where data is

pulled from and how it gets processed as teams develop new findings and adapt to

12

Page 13: 1

internal and external needs. Dynamic data models (also known as schemas) and

scalability are capabilities to seek out.

Licensing Freedom. Open-source products are typically easier to adopt, as teams can get

started quickly with free community versions of the software. They are also usually easier

to scale from a licensing standpoint, as teams can buy more licenses as requirements

increase. By contrast, in many cases proprietary software vendors require large, upfront

license purchases, which make it harder for teams to get moving quickly and to scale in

the future.

MongoDB’s ease of use, dynamic data model and open- source licensing model make it the most

agile Big Data solution available.

6. GENERAL PURPOSE VS. NICHE SOLUTIONS

Organizations are constantly trying to standardize on fewer technologies to reduce complexity, to

improve their competency in the selected tools and to make their vendor relationships more

productive. Organizations should consider whether adopting a Big Data technology helps them

address a single initiative or many initiatives. If the technology is general purpose, the expertise,

infrastructure, skills, integrations and other investments of the initial project can be amortized

across many projects. Organizations may find that a niche technology may be a better fit for a

single project, but that a more general purpose tool is the better option for the organization as a

whole.

13

Page 14: 1

4. ADVANTAGES OF BIGDATA

The practical advantages of bigdata are as follows:

. Dialogue with consumers

Today’s consumers are a tough nut to crack. They look around a lot before they buy, talk to their

entire social network about their purchases, demand to be treated as unique and want to be

sincerely thanked for buying your products. Big Data allows you to profile these increasingly

vocal and fickle little ‘tyrants’ in a far-reaching manner so that you can engage in an almost one-

on-one, real-time conversation with them. This is not actually a luxury. If you don’t treat them

like they want to, they will leave you in the blink of an eye.

Just a small example: when any customer enters a bank, Big Data tools allow the clerk to check

his/her profile in real-time and learn which relevant products or services (s)he might advise. Big

Data will also have a key role to play in uniting the digital and physical shopping spheres: a

retailer could suggest an offer on a mobile carrier, on the basis of a consumer indicating a certain

need in the social media.

Re-develop your products

Big Data can also help you understand how others perceive your products so that you can adapt

them, or your marketing, if need be. Analysis of unstructured social media text allows you to

uncover the sentiments of your customers and even segment those in different geographical

locations or among different demographic groups.

On top of that, Big Data lets you test thousands of different variations of computer-aided designs

in the blink of an eye so that you can check how minor changes in, for instance, material affect

costs, lead times and performance. You can then raise the efficiency of the production process

accordingly.

Perform risk analysis

Success not only depends on how you run your company. Social and economic factors are

crucial for your accomplishments as well.  Predictive analytics, fueled by Big Data allows you to

14

Page 15: 1

scan and analyze newspaper reports or social media feeds so that you permanently keep up to

speed on the latest developments in your industry and its environment. Detailed health-tests on

your suppliers and customers are another goodie that comes with Big Data. This will allow you

to take action when one of them is in risk of defaulting.

Keeping your data safe

You can map the entire data landscape across your company with Big Data tools, thus allowing

you to analyze the threats that you face internally. You will be able to detect potentially sensitive

information that is not protected in an appropriate manner and make sure it is stored according to

regulatory requirements. With real-time Big Data analytics you can, for example, flag up any

situation where 16 digit numbers – potentially credit card data - are stored or emailed out and

investigate accordingly.

Create new revenue streams

The insights that you gain from analyzing your market and its consumers with Big Data are not

just valuable to you. You could sell them as non-personalized trend data to large industry players

operating in the same segment as you and create a whole new revenue stream.

One of the more impressive examples comes from Shazam, the song identification application. It

helps record labels find out where music sub-cultures are arising by monitoring the use of its

service, including the location data that mobile devices so conveniently provide. The record

labels can then find and sign up promising new artists or remarket their existing ones

accordingly.

Customize your website in real time

Big Data analytics allows you to personalize the content or look and feel of your website in real

time to suit each consumer entering your website, depending on, for instance, their sex,

nationality or from where they ended up on your site. The best-known example is probably

offering tailored recommendations: Amazon’s use of real-time, item-based, collaborative

filtering (IBCF) to fuel its ‛Frequently bought together’ and ‛Customers who bought this item

also bought’ features or LinkedIn suggesting ‛People you may know’ or ‛Companies you may

15

Page 16: 1

want to follow’. And the approach works: Amazon generates about 20% more revenue via this

method.

Reducing maintenance costs

Traditionally, factories estimate that a certain type of equipment is likely to wear out after so

many years. Consequently, they replace every piece of that technology within that many years,

even devices that have much more useful life left in them. Big Data tools do away with such

unpractical and costly averages. The massive amounts of data that they access and use and their

unequalled speed can spot failing grid devices and predict when they will give out. The result: a

much more cost-effective replacement strategy for the utility and less downtime, as faulty

devices are tracked a lot faster.

Offering tailored healthcare

We are living in a hyper-personalized world, but healthcare seems to be one of the last sectors

still using generalized approaches. When someone is diagnosed with cancer they usually undergo

one therapy, and if that doesn’t work, the doctors try another, etc. But what if a cancer patient

could receive medication that is tailored to his individual genes? This would result in a better

outcome, less cost, less frustration and less fear.

With human genome mapping and Big Data tools, it will soon be commonplace for everyone to

have their genes mapped as part of their medical record. This brings medicine closer than ever to

finding the genetic determinants that cause a disease and developing drugs expressly tailored to

treat those causes — in other words, personalized medicine.

Offering enterprise-wide insights

Previously, if business users needed to analyze large amounts of varied data, they had to ask their

IT colleagues for help as they themselves lacked the technical skills for doing so. Often, by the

time they received the requested information, it was no longer useful or even correct. With Big

Data tools, the technical teams can do the groundwork and then build repeatability into

algorithms for faster searches. In other words, they can develop systems and install interactive

and dynamic visualization tools that allow business users to analyze, view and benefit from the

data.

16

Page 17: 1

Making our cities smarter

To help them deal with the consequences of their fast expansion, an increasing number of smart

cities are indeed leveraging Big Data tools for the benefit of their citizens and the environment.

The city of Oslo in Norway, for instance, reduced street lighting energy consumption by 62%

with a smart solution. Since the Memphis Police Department started using predictive software in

2006, it has been able to reduce serious crime by 30 %. The city of Portland, Oregon, used

technology to optimize the timing of its traffic signals and was able to eliminate more than

157,000 metric tons of CO2emissions in just six years – the equivalent of taking 30,000

passenger vehicles off the roads for an entire year. 

These are few practical examples of BIGDATA.

17

Page 18: 1

5. RISKS OF BIGDATA

Big data has gotten a lot of press recently – and rightly so. With the vast amounts of data now

available, we can do more than could have been imagined in previous decades. But there is

another face to big data … and that is, companies now have to manage some very big risks.

It’s hard to visualize the amount of data we’re talking about. But as on a article put it, “In 2011

alone, 1.8 zettabytes (or 1.8 trillion gigabytes) of data will be created, the equivalent to every

U.S. citizen writing 3 tweets per minute for 26,976 years.” And this number is anticipated to

grow by a magnitude of 50 times by the year 2020.

Risk #1: Loss of agility

In a typical large-scale organization, data is housed on multiple platforms. There is transactional

data, email data, analytics data, etc. Management wants people to be able to locate, analyze, and

make decisions based on this data quickly. It is a necessity in today’s marketplace where

conditions can change in an instant. But if the data isn’t evaluated, organized, and stored

properly, critical information can be either difficult or impossible to find – slowing a business

down at the exact moment when speed is essential.

Risk #2: Loss of compliance

Laws are getting more and more complex with regard to how long companies need to retain

data, how they need to retain it, and where they need to retain it. There are both general

regulations in place as well as state- or industry-specific regulations that may apply. It is not

uncommon for regulators to perform random audits to examine a company’s policies regarding

data and their actual management of that data. A compliance failure can result in significant fine

or damage to reputational risk.

Risk #3: Loss of security

With more data located in and moving between more places than ever before, there are also a

vastly increased number of ways to hack into that data. A security breach can result in theft,

fraud, fines … and, of course, reputational loss. No company wants to be featured on the front

page of the Wall Street Journal because they’ve been hacked.

18

Page 19: 1

Risk #4: Loss of money

As the amount of data grows, it is all too tempting to simply throw more servers at the problem.

After all, storage is cheap, isn’t it? But consider this: I once worked with a client who said they

needed an entire new data center to house their data. SunGard Availability Services did studies

and found that not only did they not need a new data center; they actually needed only half their

current storage because they simply weren’t managing their data well. A server may seem

inexpensive at first glance – but never assume that storage is cheap.

Big data is a good thing. No question about it. But big risky data is a bad thing. Companies today

need to manage their data to minimize their risk. This involves having policies that are in

compliance with regulatory standards, processes that cover all contingencies, retention schedules

that are up to date, and a consistent self-evaluation to determine what data is necessary for the

proper functioning of the company.

The more efficiently companies store, manage, and host their data, the more agile, compliant,

secure, and cost-effective they will be.

And that will take the big risk out of big data.

19

Page 20: 1

6. CONCLUSION

We have entered an era of Big Data. Through better analysis of the large volumes of data that are

becoming available, there is the potential for making faster advances in many scientific

disciplines and improving the profitability and success of many enterprises. However, many

technical challenges described in this paper must be addressed before this potential can be

realized fully. The challenges include not just the obvious issues of scale, but also heterogeneity,

lack of structure, error-handling, privacy, timeliness, provenance, and visualization, at all stages

of the analysis pipeline from data acquisition to result interpretation. These technical challenges

are common across a large variety of application domains, and therefore not cost-effective to

address in the context of one domain alone. Furthermore, these challenges will require

transformative solutions, and will not be addressed naturally by the next generation of industrial

products. We must support and encourage fundamental research towards addressing these

technical challenges if we are to achieve the promised benefits of BigData.

20


Recommended