Big Data Sourcebook Second Edition

8/10/2019 Big Data Sourcebook Second Edition

1/52

WWW.DBTA.COM

From the publishers of


2/52


3/52

introduction

2

The Big Data Frontier

Joyce Wells

industry updates

4

How Businesses Are DrivingBig Data Transformation

John OBrien

10 The Enabling Force Behind

Digital Enterprises

Joe McKendrick

14 Data Integration Evolves to Support

a Bigger Analytic Vision

Stephen Swoyer

18 Turning Data Into Value Using Analytics

Bart Baesens

22 As Clouds Roll In, Expectations forPerformance and Availability Billow

Michael Corey, Don Sullivan

26 Social Media Analytics Tools

and Platforms: The Need for Speed

Peter J. Auditore

30

The Big Data Challenge to Data Quality

Elliot King

36 Building the UnstructuredBig Data/Data Warehouse Interface

W. H. Inmon

40

Big Data Poses Security Risks

Geoff Keston

CONTENTSBIG DATA

SOURCEBOOK

DECEMBER2014

BIG DATA SOURCEBOOK is published annually by Information Today, Inc.,

143 Old Marlton Pike, Medford, NJ 08055

POSTMASTER

Send all address changes to:Big Data Sourcebook,143 Old Marlton Pike, Medford, NJ 08055Copyright 2014, Information Today, Inc. All rights reserved.

PRINTED IN THE UNITED STATES OF AMERICA

The Big Data Sourcebookis a resource for IT managers and professionals providing informationon the enterprise and technology issues surrounding the big data phenomenon and the needto better manage and extract value from large quantities of structured, unstructured andsemi-structured data. The Big Data Sourcebook provides in-depth articles on the expandingrange of NewSQL, NoSQL, Hadoop, and private/public/hybrid cloud technologies, as wellas new capabilities for traditional data management systems. Articles cover business- andtechnology-related topics, including business intelligence and advanced analytics, data securityand governance, data integration, data quality and master data management, social mediaanalytics, and data warehousing.

No part of this magazine may be reproduced and by any meansprint, electronic or anyotherwithout written permission of the publisher.

COPYRIGHT INFORMATION

Authorization to photocopy items for internal or personal use, or the internal or personal useof specific clients, is granted by Information Today, Inc., provided that the base fee of US $2.00per page is paid d irectly to Copyright Clearance Center (CCC), 222 Rosewood Drive, Danvers,MA 01923, phone 978-750-8400, fax 978-750-4744, USA. For those organizations that havebeen grated a photocopy license by CCC, a separate system of payment has been arranged.Photocopies for academic use: Persons desiring to make academic course packs with articlesfrom this journal should contact the Copyright Clearance Center to request authorizationthrough CCCs Academic Permissions Service (APS), subject to the conditions thereof. SameCCC address as above. Be sure to reference APS.

Creation of derivative works, such as informative abstracts, unless agreed to in writing by thecopyright owner, is forbidden.

Acceptance of advertisement does not imply an endorsement by Big Data Sourcebook. Big DataSourcebook disclaims responsibility for the statements, either of fact or opinion, advanced bythe contributors and/or authors.

The views in this publication are those of the authors and do not necessarily reflect the viewsof Information Today, Inc. (ITI) or the editors.

2014 Information Today, Inc.

From the publishers of

PUBLISHED BY Unisphere Mediaa Division of Information Today, Inc.

EDITORIAL & SALES OFFICE 630 Central Avenue, Murray Hill, New Providence, NJ 07974

CORPORATE HEADQUARTERS 143 Old Marlton Pike, Medford, NJ 08055

Thomas Hogan Jr., Group Publisher609-654-6266; thoganjr@infotoday

Joyce Wells, Managing Editor908-795-3704; [email protected]

Joseph McKendrick,Contributing Editor; [email protected]

Alexis Sopko, Advertising Coordinator908-795-3703; [email protected]

Adam Shepherd,Editorial and Advertising Assistant

908-795-3705

Celeste Peterson-Sloss, Lauree Padgett,Alison A. Trotta, Editorial Services

Norma Neimeister,Production Manager

Denise M. Erickson,Senior Graphic Designer

Jackie Crawford,Ad Trafficking Coordinator

Sheila Willison, Marketing Manager,Events and Circulation859-278-2223; [email protected]

DawnEl Harris, Director of Web Events;

[email protected]

ADVERTISING

Stephen Faig, Business Development Manager, 908-795-3702; [email protected]

INFORMATION TODAY, INC. EXECUTIVE MANAGEMENT

Thomas H. Hogan, President and CEO

Roger R. Bilboul,Chairman of the Board

John C. Yersak,Vice President and CAO

Thomas Hogan Jr., Vice President,Marketing and Business Development

Richard T. Kaser, Vice President, Content

Bill Spence, Vice President,Information Technology


4/52

2 BIG DATA SOURCEBOOK 2014

T , cloud, mobility, and the prolif-

eration of connected devices, coupled with newer data

management approaches, such as Hadoop, NoSQL, and

in-memory systems, are increasing the opportunities for

enterprises to harness data. However, with this new fron-

tier there are challenges to be overcome. As they work to

maintain legacy applications and systems, IT organiza-

tions must address new demands for more timely access

to more data from more users, in addition to maintain-ing continuous availability of IT systems, and enforcing

appropriate data governance.

Its a lot to think about. How can companies choose the

right approach to leverage big data while keeping newer

technologies in line with budgetary, application availabil-

ity, and security concerns?

Over the past year, Unisphere Research, a division of

Information Today, Inc., has conducted surveys among IT

professionals to gain insight into the challenges organiza-

tions are facing.

The information overload is already taking its toll on

IT organizations and professionals. According to a Uni-

sphere Research report, Governance Moves Big Data

From Hype to Confidence, the percentage of organiza-

tions with big data projects is expected to triple by the

end of 2015. However, while organizations are investing

in increasing the information at their disposal, they are

finding that they are committing more time to simply

locating the necessary data, as opposed to actually ana-

lyzing it. In addition, the report, based on a survey of 304

data management professionals and sponsored by IBM,

found that respondents tend to be less confident about

data gathered through social media and public cloud

applications.

With all this data, there are also concerns about theability to maintain the high availability mandated by

todays stringent service level agreements. According to

another Unisphere Research survey sponsored by EMC,

and conducted among 315 members of the Indepen-

dent Oracle Users Group (IOUG), close to one-fourth

of respondents organizations have SLAs of four nines of

availability or greater, meaning that they can have only 52

minutes or less of downtime a year. The survey, Bringing

Continuous Availability to Oracle Environments, found

that more than 25% of respondents dealt with more than

8 hours of unplanned downtime during the previous

year, which they attributed to network outages, server

failures, storage failures, human error, and power outages.

As data management and access becomes more critical

to business success, Unisphere Research finds that IT pro-

fessionals are embracing their expanded roles and relish

the opportunity to work with new technologies. Increas-

ingly, they want to be at the center of the action, and are

assuming roles associated with data science, but too often

they see themselves being forced into the job of firefight-ing rather than strategic, high-value tasks. The benefits of

ongoing staff training and use of cloud and database auto-

mation are some of the approaches cited in the report,

The Vanishing Database Administrator, sponsored by

Ntirety, a division of HOSTING.

Indeed, the increasing size and complexity of data-

base environments is stretching IT resources thin, caus-

ing organizations to seek ways to automate routine tasks

to free up assets such as tapping into virtualization and

cloud. According to The Empowered Database, a report

based on a survey of 338 IOUG members, and sponsored

by VMware and EMC, nearly one-third of organizations

are using or considering a public cloud service, and almost

half are currently using or considering a private cloud.

Still, we are just at the beginning of the changes to

come as a result of big data. In a recent Unisphere Research

Quick Poll, close to one-third of enterprises, or 30%,

report they have deployed the Apache Hadoop framework

in some capacity while another 26% said they planned

to adopt Hadoop within the next year. Strikingly, 91% of

respondents at Hadoop sites will be increasing their use

of Hadoop over the next 3 years, and one-third describe

expansion plans as significant. Key functions or applica-

tions supported by Hadoop projects include analytics and

business intelligence, working with IT operational data,and supporting special projects.

To help shed light on the expanding territory of big

data, DBTA presents the second annual Big Data Source-

book,a guide to the key enterprise and technology matters

IT professionals are grappling with as they take the jour-

ney to becoming data-driven enterprises. In addition to

articles penned by subject matter experts, leading vendors

also showcase their products and approaches to gaining

value from big data projects. Together, this combination

of articles and sponsored content provides insight into the

current big data issues and opportunities.

The Big Data Frontier

By Joyce Wells


5/52

DBTA.COM 3

sponsored content

I .In fact, its

a source of big data. Today, operational

databases must meet the challenges of

variety, velocity, and volume with millions

of users and billions of machines reading

and writing data via enterprise, mobile, and

web applications. The data is stored in an

operational database before its stored in anApache Hadoop distribution.

Its audits, clickstreams, customer

information, financial investments and

payments, inventory and parts, locations,

logs, messages, patient records, plays and

scores, sensor readings, scientific data, social

interactions, user and process status, user

and visitor profiles, and more.

It drives the eCommerce, energy,

entertainment, finance, gaming,

healthcare, insurance, retail, social media,

telecommunications industries, and more.

Today, operational databases must read

and write billions of values, maintain low

latency, and sustain high throughput to

meet the challenges of velocity and volume.

They must sustain millions of operations

per seconds, maintain sub-millisecond

latency, and store billions of documents

and terabytes of data. They must be able to

support the evolution of data in the form of

new attributes and new types.

The ability to meet these challenges is

necessary to support an agile enterprise.

By doing so, the agile enterprise extracts

actionable intelligence. However, time is

of the essence. When a new type of data

emerges, operational databases must store

it without delay. When the number of users

and machines increases, the operational

database must continue to provide data access

without performance degradation. When the

size of the data set increases, the operational

database must continue to store data.

These challenges are met by a)

supporting a flexible data model and b)

scaling out on commodity hardware. They

are met by NoSQL databases. They are met

by Couchbase Server. Its a scalable, high-performance, document database engineered

for reliability and availability. By supporting a

document model via JSON, it can store new

attributes and new types of data without

modification, index the data, and enable

near-real time, lightweight analytics. By

implementing a shared-nothing architecture

with no single point of failure and consistent

hashing, it can scale with ease, on-demand,

and without affecting applications. By

integrating a managed object cache and

asynchronous persistence, it can maintain

sub-millisecond response times and sustain

high throughput. Couchbase Server was

engineered for operational big data and its

requirements.

While operational databases provide real-

time data access and lightweight analytics,

they must integrate with Apache Hadoop

distributions for predictive analytics,

machine learning, and more. While

operational data feeds big data analytics,

big data analytics feed operational data. The

result is continuous refinement. By analyzingthe operational data, it can be updated to

improve operational efficiency. The result is

a big data feedback loop.

Couchbase provides and supports

a Couchbase Server plugin for Apache

Sqoop to stream data to and from Apache

Hadoop distributions. In fact, Cloudera

certified it for Cloudera Enterprise 5. In

addition, Couchbase provides and supports

a Couchbase Server plugin for Elasticsearch

to enable full text search over operational

big data.

Finally, operational databases must

meet the requirements of a global economy

in the information age. Today, users and

machines read and write data to enterprise,

mobile, and web applications from multiplecountries and regions. To maintain data

locality, operational databases must support

deployment to multiple data centers. To

maintain the highest level of data locality,

operational databases must extend to mobile

phones / tablets and connected devices.

Couchbase Server supports both

unidirectional and bidirectional cross

data center replication. It enables the agile

enterprise to deploy an operational database

to multiple data centers in multiple regions

and in multiple countries. It moves the

operational database closer to users and

machines. In addition, Couchbase Server

can extend to mobile phones / tablets and

connected devices with Couchbase Mobile.

The platform includes Couchbase Lite, and

native document database for iOS, Android,

Java / Linux, and .NET, and Couchbase Sync

Gateway to synchronization data between

local databases and remote database servers.

The combination of cross data center

replication and mobile synchronizationenables the agile enterprise to extend global

reach to individual users and machines. If

deployed to cloud infrastructure like Amazon

Web Services or Microsoft Azure, there is no

limit to how far Couchbase Server can scale

or how far the agile enterprise can reach.

COUCHBASE

www.couchbase.com

Operational Big Data


6/52

industry

u

pdates


By John OBrien

How Businesses

Are Driving Big DataTransformation

I ,we continued to watch how big data

is enabling all things big about data and its

business analytics capabilities. We also saw theemergence (and early acceptance) of Hadoop

Version 2 as a data operating platform, with

cornerstones of YARN (Yet Another Resource

Negotiator) and HDFS (Hadoop Distributed

File System). The ecosystem of Apache Foun-

dation projects has continued to mature at a

rapid pace, while vendor products continue

to join, mature, and benefit from Hadoop

improvements.

In last years Big Data Sourcebook we

highlighted several items in The State of

Big Data article worth recapping. First, we

referenced the battle over persistence for

data architectures, primarily in enterpriseadoption that dealt with the promise of

everything in Hadoop pundits and the its

OK to have another data platform. In 2014,

we witnessed the acceptance of these multi-

tiered, specific workload capability architec-

tures that, at Radiant Advisors, we refer to

as the modern data platform. With gaining

acceptance, Hadoop is here to stay and many

analysts refer to its role as inevitable. This,

naturally, is tempered with its maturity, the

ability for enterprises to find and/or train

resources, and specifying the proper first use

case project and long term strategy, such as

the data lake or enterprise data hub strategies.We also discussed how companies needed

to understand how data is data when

approaching big data with big eyes. For

the most part, in 2014 we saw mainstream

companies shift from a the sky is falling if I

dont start a big data project mindset to dis-

tinguishing big data projects as those for sit-

uations where the data wasnt typically rela-

tionally structured, or when it had volatile

schemas. Schema on read versus schema

on write benefits and situations became a

The State of Big Data in 2014


7/52

DBTA.COM 5

industry

updates

The next waveof big data implementations bymainstream adopters is expected to be multiple

times larger than that of the early adopters.


much better understood term in 2014, too.

And, more importantly, we have seen an

increasing understanding that all data can

be valuable and the need to explore data

for discovery and insights.Last year, we said that 2014 would be

the race for access hill as companies

demanded better access to data in Hadoop

by business analysts and power users and

that this access no longer be restricted to

programmers. As SQL reasserted itself as

the de-facto standard for common knowl-

edge users and existing data analysis and

integration tools, the SQL access capa-

bilities of Hadoop was under incredible

pressure to improve both in performance

and capability. Continued releases by Hor-

tonworks with Hive/Tez, Cloudera Impala,

and MapR Drill initiative made orders

of magnitude performance improvements

for SQL access. The race was on: Actians

Vortex made a splash at the Hadoop Sum-

mit in June, and otherssuch as IBM and

Pivotalmade significant improvements,

too. The race in 2014 continues going into

2015 with more SQL analytic capabilities

and performance improvements.

Hadoop 2 Ushers in

the Next GenerationThe significance of Hadoop 2 has

recently started to resonate with com-

panies and enterprise architects. Mov-

ing away from its batch-oriented origins,

YARN has clearly positioned the data

operating system as two separate funda-

mental architecture components.

While the HDFS will continue to evolve

as the caretaker of data in the distributed

file system architecture with improved

name node high availability and perfor-

mance, YARN, introduced in Hadoop 2,

completely changes the paradigm of data

engines and access. Though the primary

role of YARN is still that of a resource nego-

tiator for the Hadoop cluster and focusedon managing the resource needs of tens of

thousands of jobs in the cluster, it has also

now established a new framework.

The YARN framework serves as a plug-

gable layer of YARN certified engines

designed to work the data in different

ways. Previously, MapReduce was the pri-

mary programming framework for devel-

opers to create applications that leveraged

the parallelism of the data nodes. As other

project and data engines could work with

HDFS directly without MapReduce, a

centralized resource manager was needed

that would also enable innovation for new

data engines. MapReduce became its own

YARN engine for existing Hadoop 1 legacy

code, and Hive decoupled to work with

the new Tez engine. Long recognized as

ahead of the curve, Google caused quite a

fury when it announced that MapReduce

was dead and that they would no longer

develop in it. YARN was positioned for the

future of next-generation engines.

Sometimes in 2014 we felt that the

booming big data drum was starting todie down. And, sometimes we wondered

if it only seemed that was because every-

one was chanting Storm just a bit louder.

Another major driver in the Hadoop

implementations was that big data didnt

mean fast data. The industry wanted

both big and fast: The Spark environment

is where both early adopters were writing

new applications, and the development

community was quickly developing Spark

to be a high-level project to meet those

needs. The Spark community touts itself

as lightning-fast cluster computing pri-

marily leveraging in-memory capabilities

of the data nodes, but also a newer, faster

framework than MapReduce on disk.While Spark was in its infancy in 2013,

we saw this need for big data speed being

tackled by two-tier distributed in-memory

architectures. Today, Spark is a framework

for Spark SQL, Spark Streaming, Machine

Learning, and GraphX running on

Hadoop 2s YARN architecture. In 2014,

this has been very exciting for the industry,

but many of the mainstream adopters are

patiently waiting for the early adopters to

do their magic.

Two Camps: Early Adoptersand Mainstream Adopters

For years, overwhelming data volumes,

complexity, or data science endeavors were

the primary drivers behind early big data

adopters. Many of these early adopters

were in internet-related industries, such

as search, e-commerce, social networking,

or mobile applications that were dealing

with the explosion of internet usage and

adoption.

In 2014, we saw mainstream adopters

become the next wave of big data imple-mentations that are expected to be multi-

ple times larger than the early adopters. We

define mainstream adopters as those busi-

nesses that seek to modernize their data

platforms and analytics capabilities for

competitive opportunities and to remain

relevant in a fast changing world, but are

tempered with some time to research, ana-

lyze, and adopt while maintaining current

business operations. Mainstream adopt-

ers have had pilots and proof of concepts


8/52

industry

u

pdates



for the past year or two with one or two

Hadoop distributors and now are decid-

ing how this also fits within their overall

enterprise data strategy.Leading the way for mainstream adopt-

ers is, by consequence, meeting enterprise

and IT requirements for data management,

security, data governance, and compliance

in a new, more complicated, set of data

that includes public social data, private

customer data, third-party data enrich-

ment, and storage in cloud and on-prem-

ises. Over the past year, it has often felt like

the fast-driving big data vehicle hit some

pretty thick mud to plow through, and

some in the industry argued that forc-ing Hadoop to meet the requirements of

enterprise data management was missing

the point of big data and data science. For

now, we have seen most companies agree

that risk and compliance are things that

they must take seriously moving forward.

Mainstream Adopters RedefiningCommodity Hardware

As mainstream adopters worked

through data management and governance

hurdles for enterprise IT, next up was the

startling exclamation: I thought you said

that was cheap commodity hardware?!

This has become an interesting reminder

of the roots of big data and the difference

with IT enterprise-class hardware.

The explanation goes like this. Early

developers and adopters were driven to

solve truly big data challenges. In the sim-

plest of terms, big data meant big hardware

costs and, in order to solve that economic

challenge, big data needed to run on the

lowest cost commodity hardware and

software that was designed to be fault-tolerant to cope with high failure rates with-

out disrupting service. This is the purpose

of HDFS, though HDFS does not differen-

tiate how a data node is configured and

this is where ITs standard order list differs.

Enterprise infrastructure organiza-

tions have been maintaining the data cen-

ter needs of companies for years and have

efficiently standardized orders with chosen

vendors. In this definition of commodity

servers, its more about industry standards

in parts, and no proprietary hardware

could limit the use of these servers as data

nodes (or any other server needs in the data

center). While big data implementationwith hundreds to thousands of servers per

cluster strive for the lowest cost white box

servers from less recognized industry ven-

dors with the lowest cost components, their

commodity servers can be as low as $2,000

per server. Similar servers from industry

recognized big names with their own com-

ponents or industry best of breed com-

ponents touting stringent integration and

quality testing have averaged $25,000 per

server in several recent Hadoop implemen-

tations that we have been involved with. Wehave started to coin these servers as com-

modity-plus for mainstream companies

operationalizing Hadoop clustersand

they dont seem to mind.

Another discussion that continues

from the early adopters is how a data

node should be configured. Some imple-

mentations concerned with truly big data

configure data nodes with 25 front-load-

ing bays and multi-terabyte slower SATA

drives for the highest capacity within

their cluster. Other implementations are

more concerned with performance and

opt for faster SAS drives at lower capaci-

ties but balanced with more servers in the

cluster for further increased performancefrom parallelism. Some hyper-perfor-

mance-oriented clusters will even opt for

faster SSD drives in the cluster. This also

leads to discussions regarding multi-core

CPUs and how much memory should

be in a data node. And, there have been

equations for the number of cores related

to the amount of memory and number of

drives for optimal performance of a data

node. We have seen that enterprise infra-

structure has leaned more toward fewer

nodes in a production cluster (832 data

nodes) rather than 100-plus nodes. Their

reasoning is twofold: More powerful data

nodes are actually more interchangeablewith data centers also converging data

virtualization and private cloud strate-

gies. Second, ordering more of the pow-

erful servers can yield increased volume

discounts and maintain standardization

of IT servers in the data center.

The Data Lake Gains TractionIn 2014, we saw more acceptance of

the term data lake as an enterprise data

architecture concept pushed by Horton-

works and its modern data architectureapproach. The enterprise data hub is a

similar concept promoted by Cloudera

and also has some of the industry mind-

share. Informally, we saw the data lake term

used most often by companies seeking to

understand an approach to enterprise data

strategy and roadmaps. However, we also

saw backlash from industry pundits that

called the data lake a fallacy or murky.

Terms such as data swamp and data

dump were also thrown around as how

things could go wrong without a good

strategy and governance in place. Like the

term big data, the data lake has started

out as a high-level concept to drive further

definition and patterns going forward.

Throughout 2014, we worked with

companies ready to define a clear, detailed

strategy based on the data lake concept for

enterprise data strategy. While this is pro-

found, it is very achievable with data man-

agement principles that require answers to

new questions regarding a new approach

to data architecture. Some issues are sim-

ple and more technical, such as keepingonline archiving of historical data ware-

house data still easily accessible by users

with revised service-level agreements.

Some issues are more fundamental, such as

the data lake serving as single repository of

all data including being a staging area for

the enterprise data warehouse (with lower

cost historical persistence for other uses as

data scientists are more interested in raw

unaltered data). Other concerns are a bit

more complex, such as persisting customer

YARN,introduced in

Hadoop 2, completely

changes the paradigmof

data engines and access.


9/52


10/52

industry

u

pdates


or other privacy-compliant data in the data

lake for analysis purposes. Data governance

is concerned with who has access to priva-

cy-controlled data and how it is used. Datamanagement questioned the duplication

of enterprise data and consistency.

These are hard data management and

governance decisions for enterprises to

make, but they are making themand

acknowledging that patience and adapt-

ability are key for the coming years as

data technologies continue to evolve and

change the landscape. The data lake will

continue to prove itself and make a fun-

damental shift in enterprise architecture

in the coming years. When you take a stepback and watch the business and IT driv-

ers, momentum, and technology develop-

ment, you can see how the data lake will

become an epicenter in enterprise data

architecture. If you take two steps back,

you will see how 2015 developments could

begin the evolution that transforms the

data lake into a data operating system for

the enterprise, evolving beyond business

intelligence and analytics into operational

applications and further realization of ser-

vice-oriented architectures.

Whats AheadIn 2015, the mainstream adoption with

enterprise data strategies and acceptance

of the data lake will continue as data man-

agement and governance practices provide

further clarity. The cautionary tale of 2014

to ensure business outcomes drive big data

adoption, rather than the hype of previ-

ous years will likewise continue. Hadoop

is clearly here to stay and inevitable,

and will have its well-deserved seat at the

enterprise data table, along with otherdata technologies. While Hadoop wont be

taking over the world any time soon and

principle-based frameworks (such as our

own modern data platform) recognize the

evolution of both data technologies and

computing price/performance on mod-

ern data architecture. Besides the usual

maturing and improvements overall and

for existing big data tools, we predict some

major achievements in big data for 2015

that were keeping an eye on.

The Apache Spark engine will con-

tinue to mature, improve, and gain accep-

tance in 2015. With this adoption and the

incredible capabilities that it delivers, wecould start to see applications and capabil-

ities beyond our imagination. Keep an eye

for these early case studies as inspiration

for your own needs.

With deepening acceptance and recog-

nition of YARN as the standard for operat-

ing Hadoop clusters, open-source projects

and existing vendors will port their prod-

ucts to YARN certification and integration.

This will not only close the gap between

existing data technologies to work with

Hadoop clusters but more exciting will beto see data technologies port over to YARN

so that they can operate and improve their

own capabilities within Hadoop. New

engines and existing engines running on

YARN in 2015 will further influence and

drive the adoption of Hadoop in enter-

prise data architecture.

In 2014, we saw mainstream compa-

nies requiring data management features

such as security and access control. These

first steps will be critical to keep an eye on

during 2015 for your own companys data

management requirements. Our concern

here is that the sexy high-performanceworld of Spark and improved SQL capa-

bilities will get the majority of attention,

while the less sexy side of security and gov-

ernance will not mature at the same rate.

There is significant pressure to do so with

the mountain of mainstream adopters

waiting, so well keep an eye on this one.

Finally, our most exciting item to watch

in 2015 will be Hadoops subtle transfor-

mation as business drivers move it beyond

a primary write-once/read-many repu-

tation to that of full create/read/update/

delete (CRUD) operational capability at

big data scale. The benefits of the Hadoop

architecture with YARN and HDFS gowell beyond big data analytics and enter-

prise data architects can start thinking

about what a YARN data operating system

can do with operational systems. In a few

years, this could also redefine the data lake

or well simply create another label for

the industry to debate. Once big data, high

performance, and CRUD requirements are

met within Hadoop, enterprise architects

will start thinking about the economies

of scale and efficiency gained from this

next-generation architecture.

John OBrien is princi-

pal and CEO of Radiant

Advisors. With more than

25 years of experience

delivering value through

data warehousing and

business intelligence pro-

grams, OBriens unique perspective

comes from the combination of his roles

as a practitioner, consultant, and vendor

CTO in the BI industry. As a globally rec-

ognized business intelligence thought

leader, OBrien has been publishing arti-

cles and presenting at conferences in

North America and Europe for the past 10

years. His knowledge in designing, build-

ing, and growing enterprise BI systems

and teams brings real-world insights to

each role and phase within a BI pro-

gram. Today, through Radiant Advisors,

OBrien provides research, strategic advi-

sory services, and mentoring that guidecompanies in meeting the demands of

next-generation information management,

architecture, and emerging technologies.

In Q1 2014, Radiant Advisors released its

Independent Benchmark: SQL on Hadoop

Performance that captured the current

state of options and widely varying perfor-

mance. Radiant Advisors plans to release

the next benchmark 1 year later in Q1 2015

to quantify those efforts.

In 2015, watch for Hadoops

subtle transformationas

business drivers move it

beyond a primary write-once/

read-many reputation.



11/52

DBTA.COM 9

sponsored content

T of enterprise solutions

has changed. It has become distributed and

real-time work. A famousNY Times writer

Thomas Friedman summarizes it succinctly,

The World is Flat. In addition to this

technological advancement, the compute

and online world is demanding real-time

answers to questions. These ever growing

and disparate data sources need to beefficiently connected to enable new discovery

and more insightful answers.

To maintain competitive advantage in

this new landscape, organizations must be

prepared to weed out the hype and focus

on proven ways to future-proof existing

systems while efficiently integrating with

new technologies to provide the required

value of real-time insight to users and

decision-makers. Companies need to focus

on the following key requirements for new

technologies to take advantage of data and

find unique business value, new revenues.

DISTRIBUTED

The world is moving towards distributed

architectures. Memory is becoming a

commodity; the Internet is easily accessible

and fairly inexpensive and with more sources

of data creating an increase in information it

is easy to understand how organizations will

require multiple, distributed data centers to

store it all.

With distributed architectures comes a

need for distributed features such as parallelingest or the ability to quickly obtain data

using multiple resources/locations to enable

real-time application access to information

that is being processed. Then there is a

need for distributed task processing, which

helps to move the processes closer to the

locations where data is stored, thus saving

time and improving query performance as a

side effect. Finally, there becomes a need for

distributed query as well. This is the ability

to perform a search of data across different

locations, quickly in order to find hidden

value within the data for improved business

decision support.

SCALABLE

The next requirement revolves around

ease of scalability. When working with

distributed architecture, it is inevitable that

companies will need to eventually scale outtheir applications across multiple locations

in order to keep up with growing data

demands. Technology that is easily scalable/

adaptable is very important in long-term

success and helps with managing ROI.

FLEXIBLE

Another requirement, due to the many

different types of data being collected, is the

ability to handle multiple data types. If a

technology is too limited in the way it needs

to collect information from structured,

unstructured, semi-structured sources,

organizations will find it difficult to grow

their solution long-term due to concerns

with data type limitations. On the other

hand, a technology that is able to natively or

alternatively store and access many types of

information from multiple data sources will

be key to enabling long-term competitive

advantage and growth.

COMPLEMENTARY

And finally, there is a need to address

existing and legacy solutions alreadyimplemented at a large scale. Most

enterprises will not be tearing out widely

implemented solutions spanning across

their organization. It is important to require

that any new technologies being assessed

have the ability to complement existing

legacy solutions as well as any potential new

technologies that may add benefit to the

business, its customers and solution/services.

Todays enterprise success depends on the

ability to obtain key information quickly and

accurately and then apply that knowledge

to your business to make more reliable

decisions. Utilizing technology that is able

to offer the peace of mind to be successful

through distributed, scalable, flexible and

complementary features is priceless.

For over a quarter century, Objectivity,

Inc.s embedded database software has

helped discover and unlock the hiddenvalue in Big Data for improved real-

time intelligence and decision support.

Objectivity focuses on storing, managing

and searching the connection details

between data. Its leading edge technologies,

InfiniteGraph, a unique distributed, scalable

graph database, and Objectivity/DB, a

distributed and scalable object management

database, enable unique search and

navigation capabilities across distributed

datasets to uncover hidden, valuable

relationships within new and existing data

for enhanced analytics and facilitate custom

distributed data management solutions for

some of the most complex and mission-

critical systems in operation around the

world today.

By working with a well-established

technology provider with long-term, proven

Big Data implementations, enterprise

companies can feel confident that the future

requirements of their organizations will be

met along with the ability to take advantage

of new technological advances to keep ahead

of the market.For more information on how to get

started with evaluating technologies for your

business, contact Objectivity, Inc. to inquire

about our complimentary 2-hour solution

review with a senior technical consultant.

Visit our website at www.objectivity.com for

more information.

OBJECTIVITY, INC.

www.objectivity.com

Big Data for Tomorrow


12/52

industry

u

pdates


By Joe McKendrick

TheEnabling Force

Behind DigitalEnterprises

F ,data management was part of

a clear and well-defined mission in organiza-

tions. Data was generated from transaction

systems, then managed, stored, and secured

within relational database management sys-

tems, with reports built and delivered to busi-

ness decision makers specs.

This rock-solid foundation of skills,

technologies, and priorities served enter-

prises well over the years. But lately, this

arrangement has been changing dramati-

cally. Driven by insatiable demand for IT

services and data insights, as well as theproliferation of new data sources and for-

mats, many organizations are embracing

new technology and methods such as cloud,

database as a service (DBaaS), and big data.

And, increasingly, mobile isnt part of a ven-

dors pitch sheet, or futuristic overview at a

conference presentation. Its part of todays

reality, a part of everyday business. Many

organizations are already providing faster

delivery of applications, differentiated prod-

ucts and services, and some are building

new customer experiences through social,

mobile, analytics, and cloud.

Over the coming year2015we will

likely see the acceleration of the following

dramatic shifts in data management:

1. More Automation to Manage

the Squeeze

There is a lot of demand coming from the

user side, but data management profession-

als often find themselves in a squeeze. Busi-

ness demand for database services as well as

associated data volumes is growing at a rateof 20% a year on average, a survey by Uni-

sphere Research finds. In contrast, most IT

organizations are experiencing flat or shrink-

ing budgets. Other factors such as substantial

testing requirements and outdated manage-

ment techniques are all contributing to a cost

escalation and slow IT response.

Database professionals report that they

spend more time managing database lifecy-

cles than anything else. A majority still over-

whelmingly perform a range of tasks manu-

ally, from patching databases to performing

upgrades. Compliance remains important

and requires attention. As databases move

into virtualized and cloud environments,

there will be a need for more comprehen-

sive enterprise-wide testing. Another recent

Unisphere Research study finds that for more

than 50% of organizations, it takes their IT

department 30 days or more to respond to

new initiatives or deploy new solutions. For

a quarter of organizations, it takes 90 days

or more. In addition, more than two-thirds

of organizations indicate that the numberof databases they manage is expanding. The

most pressing challenges they are facing as

a result of this expansion are licensing costs,

additional hardware and network costs, addi-

tional administration costs, and complexity.

(The Empowered Database: 2014 Enterprise

Platform Decisions Survey, September 2014)

As data professionals find their time

and resources squeezed between managing

increasingly large and diverse data stores,

increased user demands, and restrictive

The State of Big Data Management


13/52

BUILD A SINGLE VIEW

OF YOUR CUSTOMER

Dont spend months rebuilding

data tables to combine multiple

systems.

Make decisions based on your business

needs not on the limitations of

NoSQL solutions.

Use Postgres with JSON to import your data

and relate customer to contracts and contracts

to customers and anything else.

Visit www.enterprisedb.comto learn more.


14/52

industry

u

pdates


Changes are being driven by insatiable demand

for IT services and data insights, as well as the

proliferationof new data sources and formats.

budgets, there will be greater efforts

to automate data management tasks.

Expect a big push to automation in the

year ahead.

2. Big Data Becomes Part ofNormal Day-to-Day Business

Relational data coming out of transac-

tional systems is now only part of the enter-prise equation, and will share the stage to

a greater degree with data that previously

could not be cost-effectively captured, man-

aged, analyzed, and stored. This includes

data coming in from sensors, applications,

social media, and mobile devices.

With increased implementations

of tools and platforms to manage this

dataincluding NoSQL databases and

Hadooporganizations will be better

equipped to prepare this data for con-

sumption by analytic software. A recent

survey of Database Trends and Applications

readers finds 26% now running Hadoop

within their enterprisesup from 12% 3

years ago. A majority, 63%, also now oper-

ate NoSQL databases at their locations

(DBTA Quick Poll: New Database Tech-

nologies, April 2014).

3. Cloud Opens UpDatabase as a Service

More and more, data managers and

professionals will be working with cloud-

based solutions and data, whether asso-ciated with a public cloud service, or an

in-house database-as-a-service (DBaaS)

solution. This presents many new oppor-

tunities to provide new capabilities to

organizations, as well as new challenges.

Moving to cloud means new program-

ming and data modeling approaches will

be needed. Integration between on-prem-

ises and off-premises data also will be

intensifying. Data security will be a front-

burner issue.

Recent Unisphere Research surveys

find that close to two-fifths of enterprises

either already have or are considering run-

ning database functions within a private

cloud, and about one-third are currently

using or considering a public cloud ser-

vice. For more than 25% of organizations,

usage of private-cloud services increased

over the past year.Cloud and virtualization are being

seamlessly absorbed into the jobs of most

database administrators, and in some

cases, reducing traditional activity while

expending their roles. Database as a ser-

vice (DBaaS), or running databases and

managing data within an enterprise pri-

vate cloud setting, offers data managers

and executives a means to employ shared

services to manage their fast-growing

environments. The potential advantage

of DBaaS is that database managers need

not re-create processes or environments

from scratch, as these resources can be

pre-packaged based on corporate or com-

pliance standards and made readily avail-

able within the enterprise cloud. Close to

half of enterprises say they would like

to see capacity planning services offered

through private clouds, while 40% look

for shared database resources. A sim-

ilar number would value cloud-based

services providing automated database

provisioning.

4. Virtualization and Software-DefinedData Centers on the Way

Until recently, mentioning the term

platform brought images of Windows,

mainframe, and Linux servers to mind.

However, for most enterprises, platform

has become irrelevant. This extends to

the database sphere as wellmany of the

functions associated with specific data-

bases can be abstracted away from under-

lying hardware and software.

The use of virtualization is helping

to alleviate strains being created by the

increasing size and complexity of data-

base environments. The use of virtual-

ization within database environments is

increasing. Almost two-thirds of organi-

zations in a recent Unisphere Research

survey say there have been increases over

the past year. Nearly half report that morethan 50% of their IT infrastructure is

virtualized. The most common benefits

organizations report as a result of using

virtualization within their database envi-

ronments are reduced costs, consolidation,

and standardization of their infrastructure

(The Empowered Database: 2014 Enter-

prise Platform Decisions Survey, Septem-

ber 2014).

Another emerging trendsoftware-

defined data centers, software-defined

storage, and software-defined network-

ingpromises to take this abstraction to a

new level. Within a software-defined envi-

ronment, services associated with data

centers and database servicesstorage,

data management, and provisioningare

abstracted into a virtual service layer. This

means managing, configuring, and scaling

data environments to meet new needs will

increasingly be accomplished from a sin-

gle control panel. It may take some time

to reach this stage, as many of the compo-

nents of software-defined environments

are just starting to fall into place. Expectto see significant movement in this direc-

tion in 2015.

5. Data Managers and ProfessionalsWill Lead the Drive to SecureCorporate Data

One need only look at recent headlines

to understand the importance of data

securitymajor enterprises have suf-

fered data breaches over the past year, and

in some cases, have taken CIOs and top



15/52

DBTA.COM 13

industry

updates

executives down with them. The rise of

big data and cloud, with their more com-

plex integration requirements, accessibil-

ity, and device variety, has increased theneed for greater attention to data security

and data governance issues.

Data security has evolved to that of a

top business challenge, as villains take

advantage of lax preventive and detective

measures. In many ways, it has become an

enterprise-wide issue in search of leader-

ship. Senior executives are only too pain-

fully aware of whats at stake for their

businesses, but often dont know how to

approach the challenge. This is an oppor-

tunity for database administrators andsecurity professionals to work together,

take a leadership role, and move the enter-

prise to take action.

Over the coming year, database manag-

ers and professionals will be called upon

to be more proactive and lead their com-

panies to successfully ensure data privacy,

protect against insider threats, and address

regulatory compliance. An annual survey

by Unisphere Research for the Indepen-

dent Oracle Users Group (IOUG) finds

there is more awareness than ever of the

critical need to lock down data environ-

ments, but also organizational hurdles in

building awareness and budgetary sup-

port for enterprise data security (DBA

Security Superhero: 2014 IOUG Enterprise

Data Security Survey, October 2014).

6. Mobile Becomes an Equal Client

Mobile computing is on the rise, and

increasingly mobile devices will be the cli-

ent of choice with enterprises in the year

ahead. This means creating ways to access

and work with data over mobile devices.More analytics, for example, is being

supported within mobile apps. Some of

the leading BI and analytics solutions ven-

dors now offer mobile apps that offer dash-

boardsoften configurablethat provide

insight and visibility into operational

trends to decision makers who are outside

of their offices. While industry watchers

have been predicting the democratiza-

tion of data analytics across enterprises

for years, the arrival of mobile apps as front

end clients to BI and analytics systems may

be the ultimate gateway to easy-to-use

analytics across the enterprise. By their

very nature, mobiles apps need to bedesigned to be as simple and easy to use

as possible. Over the coming year, mobile

app access to key data-driven applications

will become part of every enterprise.

The ability to access data from any and

all devices, of course, will increase secu-

rity concerns. While many enterprises

have tacitly approved the bring your own

device (BYOD) trend in recent years,

some are looking to move to corporate-

issued devices that will help lock down

sensitive data. The coming year will seeincreased efforts to better ensure the secu-

rity of data being sent to mobile devices.

7. Storage Enters the Limelight

Storage has always been an unappreci-

ated field of endeavor. It has been almost

an afterthought, seen in disk drives and

disk arrays running somewhere in the

back of data centers. This is changing rap-

idly, as enterprises recognize that storage is

shaping their infrastructures capabilities.

Theres no question that many organiza-

tions are dealing with rapidly expand-

ing data stores. Much of todays data

growthcoming out of enterprise appli-

cationsis being exacerbated by greater

volumes of unstructured, social media and

machine-generated data making their way

into the business analytics platform. Many

enterprises are also evolving their data

assets into data lakes, in which enter-

prise data is stored up front in its raw form

and accessed when needed, versus being

loaded into purpose-built, siloed data

environments.The question becomes, then, where

and how to store all this data. The storage

approach that has worked well for orga-

nizations over the decadesproduce data

within a transaction system, then send it

downstream to a disk, and ultimately, a

tape systemis being overwhelmed by

todays data demands. Not only is the

amount of data rapidly growing, but

more users are demanding greater and

more immediate access to data, even

when it may be several weeks, months, or

years old.

Over the coming year, there will be a

push by enterprises to manage storagesmartlyversus simply adding more

disk capacity to existing systems or pur-

chasing new systems from year to year. A

recent survey by Unisphere Research finds

growing impetus toward smarter storage

solutions, which include increased stor-

age efficiency through data compression,

information lifecycle management and

consolidation, or deployment strategies

such as tiered storage. At the same time,

storage expenditures keep risingeating a

significant share of IT budgets and imped-ing other IT initiatives. For those with

significant storage issues, the share stor-

age takes out of IT budgets is even greater

(Managing Exploding Data Growth in

the Enterprise: 2014 IOUG Database Stor-

age Survey, May 2014).

Whats Ahead

The year 2015 represents new oppor-

tunities to expand and enlighten data

management practices and platforms to

meet the needs of the ever-expanding

digital enterprise. To be successful, dig-

ital business efforts need to have solid

data management practices underneath.

As enterprises go digital, they will be rely-

ing on well-managed and diverse data to

explore and reach new markets.

Joe McKendrick is an

author and independent

researcher covering inno-vation, information tech-

nology trends, and markets.

Much of his research work

is in conjunction with Uni-

sphere Research, a division of Information

Today, Inc. (ITI), for user groups including

SHARE, the Oracle Applications Users Group,

the Independent Oracle Users Group, and the

International DB2 Users Group. He is also

a regular contributor to Database Trends

and Applications,published by ITI.



16/52

industry

u

pdates


The State of Data Integration

By Stephen Swoyer

Data IntegrationEvolves to Supporta Bigger Analytic Vision

W made data a

hard problem is precisely the issue of access-

ing, preparing, and producing it for machine

and, ultimately, for human consumption.

What makes this a much harder problem in

the age of big data is that the information

were consuming is vectored to us from so

many different directions.

The data integration (DI) status quo is

predicated on a model of data-at-rest. The

designated final destination for data-at-rest is

(and, at least for the foreseeable future, will

remain) the data warehouse (DW). Tradi-

tionally, data of a certain type was vectored

to the DW from more or less predictable

directionsviz., OLTP systems, or flat files

and at the more or less predictable velocities

circumscribed by the limitations of the batch

model. Thanks to big data, this is no lon-ger the case. Granted, the term big data is

empty, hyperbolic, and insufficient; granted,

theres at least as much big data hype as big

data substance. But still, as a phenomenon,

big data at once describes 1) the technolog-

ical capacity to ingest, store, manage, syn-

thesize, and make use of information to an

unprecedented degree and 2) the cultural

capacity to imaginatively conceive of and

meaningfully interact with information in

fundamentally different ways. One conse-

quence of this has been the emergence of a

new DI model that doesnt so much aim to

supplant as to enrich the status quo ante. In

addition to data-at-rest, the new DI model is

able to accommodate data-in-motioni.e.,

data as it streams and data as it pulses: from

the logs or events generated by sensors or

other periodic signalers to the signatures or

anomalies that are concomitant with aperi-

odic events such as fraud, impending failure,

or service disruption.

Needless to say, comparatively little of

this information is vectoring in from con-

ventional OLTP systems. And thatas poet

Robert Frost might put itmakes all the

difference.

Beyond Description

Were used to thinking of data in termsof the predicates we attach to it. Now as ever,

we want and need to access, integrate, and

deliver data from traditional structured

sources such as OLTP DBMSs, or flat and/

or CSV files. Increasingly, however, were

alert to, or were intrigued by, the value of the

information that we believe to be locked into

multi-structured or so-called unstruc-

tured data, too. (Examples of the former

include log files and event messages; the lat-

ter is usually used as a kitchen-sink category

to encompass virtually any data-type.) Even

if we put aside the philosophical problem

of structure as such (semantics is structure;

schema is structure; a file-type is structure),

were confronted with the fact that data inte-

gration practices and methods must and

will differ for each of these different types.

The kinds of operations and transforma-

tions we use to prepare and restructure the

normalized data we extract from OLTP sys-

tems for business intelligence (BI) reporting

and analysis will prove to be insufficient (or

quite simply inapposite) when brought to

bear against these different types of data.

The problem of accessing, preparing, and

delivering unconventional types of data from

unconventional types of sourcesas well as

of making this data available to a new class

of unconventional consumersrequires newmethods and practices, to say nothing of new

(or at least complementary) tools.

This has everything to do with what might

be called a much bigger analytic vision.

Inspired by the promise of exploiting data

mining, predictive analytics, machine learn-

ing, or other types of advanced analytics on a

massive scale, the focus of DI is shifting from

that of a static, deterministic disciplinein

which a kind of two-dimensional world is

represented in a finite number of well-defined


17/52

DBTA.COM 15

sponsored content

T,organizations have relied

upon a single data warehouse to serve

as the center of their data universe. This

data warehouse approach operated on a

paradigm in which the data revealed a single,

unified version of the truth. But today, both

the amount and types of data available have

increased dramatically. With the advent

of Big Data, companies now have access

to more business-relevant information

than ever before, resulting in many datarepositories to store and analyze it.

THE CHALLENGES OFMOVING BIG DATA

However, to use Big Data, you must

be able to move it, and the challenges of

moving Big Data are multi-faceted. Out of

the gate, the pipes between data repositories

remain the same size, while the data grows

at an exponential rate. The issue worsens

when traditional tools are used to attempt to

access, process and integrate this data with

other systems. Yet, companies cannot rely on

traditional data warehouses alone.

Thus, companies are increasingly

turning to Apache Hadoopthe free, open

source, scalable software for distributed

computing that handles both structured

and unstructured data. The movement

towards Hadoop is indicative of something

bigger: a new paradigm thats taking over

the business worldthat of the modern

data architecture and the data supply

chain that feeds it. The data supply chain

describes a new reality in which businessesfind themselves coordinating multiple data

sources rather than using a single data

warehouse. The data from these sources,

which often varies in content, structure, and

type, has to be integrated with data from

other departments and other target systems

within an enterprise. Big Data is rarely used

en mass. Instead, different types of data tell

different stories, and companies need to be

able to integrate all of these narratives to

inform business decisions.

HADOOPS ROLE IN THEDATA SUPPLY CHAIN

In this new world, companies must

constantly move data from one place to

another to ensure efficiency and lower costs.

Hadoop plays a significant role in the data

supply chain. However, its not an end-all

solution. The standard Hadoop toolsets

lack several critical capabilities, including

the ability to move data between Hadoop

and relational databases. The technologiesthat exist for data movement across

Hadoop are cumbersome. Companies need

solutions that make data movement to and

from Hadoop easier, faster, and more cost

effective.

While open source tools like Sqoop are

designed to deal with large amounts of data,

they are often not enough by themselves.

These tools can be difficult to use, require

specialized skills and time to implement,

typically focus only on certain types of data,

and cannot support incremental changes or

real-time feeds.

EFFECTIVELY MOVING BIG DATAINTO AND OUT OF HADOOP

The most effective answer to this

challenge is to implement solutions that are

specifically designed to ease and accelerate

the process of data movement across a broad

number of platforms. These technologies

allow IT organizations to easily move data

from one repository to another in a highly-

visible manner. The software should also

unify and integrate data from all platformswithin an enterprise, not just Hadoop. And

they should include change data capture

(CDC) technology to keep the target data up

to datein a way thats sensitive to network

bandwidth.

Attunity offers a solution for companies

looking to turbocharge the flows across their

data supply chain while fully supporting

a modern data architecture. Attunity

Replicate features a user-friendly GUI, with

a Click-to- Replicate design and drag-and-

drop functionality to move data between

repositories. Attunity supports Hadoop

as a source and as a target, as well as every

major commercial database platform and

data warehouse available. It is scalable and

manageable and can be used to move data

to and from the cloud when combined with

Attunity CloudBeam.

MAKING BIG DATA & HADOOP

WORK FOR YOU!Attunity enables companies to improve

their data flows to capitalize on all their data,

including Big Data sources. Their solutions

limit the investment a company needs to

make by reducing the hardware and software

needed for managing and moving data

across multiple platforms out-of-the box.

Additionally, Attunity solutions are high

performance and provide an easy-to-use

graphical interface that helps companies

make timely and fully-informed decisions.

Using high-performance data movement

software like Attunity, companies can not

only unleash the full power of Hadoop but

also the power of all their other technologies

to enable real-time analytics and true

competitive advantage.

To learn more,

download

this Attunity

whitepaper:

Hadoop and

the Modern DataSupply Chain

http://bit.ly/HadoopWP

ATTUNITY

www.attunity.com

Unleashing the Value of Big Data & Hadoop


18/52

industry

u

pdates



dimensionsto a polygonal or probabilis-

tic discipline with a much greater number

of dimensions. The static stuff will still

matter and will continue to power the

great bulk of day-to-day decision making,

but this will in turn be enriched, episod-

ically, with different types of data. The

challenge for DI is to accommodate and

promote this enrichment, even as budgets

hold steady (or are adjusted only margin-ally) and resources remain constrained.

Automatic for the People

What does this mean for data integra-

tion? For one thing, the day-to-day work

of traditional DI will, over time, be sim-

plified, if not actually automated. This

work includes activities such as 1) the

exploration, identification, and mapping

of sources; 2) the creation and mainte-

nance of metadata and documentation;

3) the automation or acceleration, insofar

as feasible, of testing and quality assur-

ance; and, crucially, 4) the deployment of

new OLTP systems and data warehouses,

as well as of BI and analytic applications

or artifacts. These activities can and will

be accelerated; in some cases (as with the

generation and maintenance of metadata

or documentation) they will, for practical,

day-to-day purposes, be more or less com-

pletely automated.

This is in part a function of the matu-

rity of the available tooling. Most DI and

RDBMS vendors ship platform-specificautomation features (pre-fab source con-

nectivity and transformation wizards; data

model design, generation, and conversion

tools; SQL, script, and even procedural

code generators; scheduling facilities; in

some cases even automated dev-testing

routines) with their respective tools. Sim-

ilarly, a passel of smaller, self-styled data

warehouse automation vendors market

platform-independent tools that purport

to automate most of the same kinds of

activities, and which are also optimized for

multiple target platforms. On top of this,

data virtualization (DV) and on-prem-

ises-to-cloud integration specialists can

bring intriguing technologies to bear, too.

Most DI vendors offer DV (or data fed-

eration) capabilities of some kind; others

market DV-only products. None of these

tools is in any sense a silver bullet: cus-

tom-fitting and design of some kind isstill required andfranklyalways will

be required. The catch, of course, is that

even though such tools can likewise help

to accelerate key aspects of the day-to-day

work of building, managing, optimizing,

maintaining, or upgrading OLTP and BI/

decision support systems, they cant and

wont replace human creativity and inge-

nuity. The important thing is that they

give us the capacity to substantively accel-

erate much of the heavy-lifting of the work

of data integration.

Big Data Integration:Still a Relatively New Frontier

This just isnt the case in the big data

world. As Douglas Adams, author of The

Hitchhikers Guide to the Galaxy,might put

it, traditional data integration tools or ser-

vices are mature and robust in exactly the

way that big data DI toolsarent.

At this point, guided and/or self-

service features (to say nothing of man-

agement-automation amenities) are still

mostly missing from the big data offerings.As a result, organizations will need more

developers and more technologists to do

more hands-on stuff when theyre doing

data integration in conjunction with big

data platforms.

Industry luminary Richard Winter

tackled this issue in a report entitled The

Real Cost of Big Data, which highlights

the cost disparity between using Hadoop

as a landing area and/or persistent store

for data versus using it as a platform for

business intelligence (BI) and decision

support workloads. As a platform for data

ingestion, persistence, and preparation,

the research suggests, Hadoop is orders of

magnitude cheaper than a conventional

OLTP or DW system. Conversely, the cost

of using Hadoop as a primary platform

for BI and analytic workloads is orders of

magnitude more expensive.

An issue that tends to get glossed overis that of Hadoops efficacy as a data man-

agement platform. Managing data isnt

simply a question of ingesting and stor-

ing it; its likewise, and to a much greater

extent, a question of retrieving just the

right data, of preparing it in just the right

format, and of delivering it at more or less

the right time. In other words, big data

tools arent only less productive, than are

those of traditional BI and decision sup-

port, but big data management platforms

are themselves comparatively immature,

too. Generally speaking, they lack support

for key database features or for core trans-

action-processing concepts, such as ACID

integrity. The simple reason for this is that

many platforms either arent databases

or eschew conventional DBMS reliabil-

ity and concurrency features to address

scaling-specific or application-specific

requirements. The upshot, then, is that

the human focus of data integration

is shifting and will continue to shift to

Hadoop and other big data platforms

not least because these platforms tend torequire considerable human oversight and

intervention.

This doesnt mean that data, appli-

cations, and other resources are shifting

or will shift to big data platforms, never

to return or to be recirculated. For one

thing, theres cloud, which is having no

less a profound impact on data integra-

tion and data management. Data must be

vectored from big data platforms (in the

cloud or on-premises) to other big data

The new DI model is able to accommodate

data-in-motioni.e., data as it streams

and data as it pulses.


19/52

DBTA.COM 17

industry

updates


platforms (in the cloud or on-premises),

to the cloud in generali.e., to SaaS, plat-

form-as-a-service (PaaS), and infrastruc-

ture-as-a-service (IaaS) resourcesand,

last but not least, to good old on-premises

resources like applications and databases.Theres no shortage of data exchange

formats for integrating data in this con-

textJSON and XML foremost among

thembut the venerable SQL language

will continue to be an important and even

a preferred mechanism for data integra-

tion in on-premises, big data, and even

cloud environments. The reasons for this

are many. First, SQL is an extremely effi-

cient and productive language: According

to a tally compiled by Andrew Binstock,

editor-in-chief of Dr. Dobbs Journal, SQLtrails only legacy languages such as .ASP

and Visual Basic (at number 1 and 2,

respectively) and Java (at number 3) pro-

ductivity-wise. (Binstock based his tally

on data sourced from the International

Software Benchmarking Standards Group,

or ISBSG, which maintains a database

of more than 6,000 software projects.)

Second, theres a surfeit of available SQL

query interfaces and/or adapters, along

with (to a lesser extent) of SQL-savvy cod-

ers. Third, open source software (OSS) and

proprietary vendors have expended a sim-

ply shocking amount of effort to develop

ANSI-SQL-on-Hadoop technologies. This

is a very good thing, chiefly because SQL

is arguably the single most promising tool

for getting the right data in the right for-

mat out of Hadoop.

Two years ago, for example, the most

efficient ways to get data out of Hadoop

included:

1. Writing MapReduce jobs in Java

in order to translate the simple

dependency, linear chain, or directedacyclic graph (DAG) operations

involved in data engineering into map

and reduce operations;

2. Writing jobs in PigLatin for Hadoops

Pig framework to achieve basically the

same thing;

3. Writing SQL-like queries in Hive

Query Language (HiveQL) to achieve

basically the same thing; or

4. Exploiting bleeding-edge technologies

(such as Cascading, an API layered

on top of Hadoop thats supposed to

make it easier to program/manage) to

achieve basically the same thing.

Today, theres no shortage of mecha-

nisms to get data from Hadoop. Take Hive,an interpreter that compiles HiveQL que-

ries into MapReduce jobs. As of Hadoop

2.x, Hive can leverage both Hadoops

MapReduce engine or the new Apache Tez

framework. Tez is just one of several designs

that exploit Hadoops new resource man-

ager, YARN, which makes it easier to manage

and allocate resources for multiple compute

engines, in addition to MapReduce. Thus,

Apache Tezwhich is optimized for the

operations, such as DAGs, that are charac-

teristic of data transformation workloadsnow offers features such as pipelining and

interactivity for ETL-on-Hadoop. Theres

also Apache Spark, a cluster computing

framework that can run in the context of

Hadoop. Its touted as a high-performance

complement and/or alternative to Hadoops

built-in MapReduce compute engine; as of

version 1.0.0, Spark is paired with Spark

SQL, a new, comparatively immature, SQL

interpreter. (Spark SQL replaces a prede-

cessor project, dubbed Shark, which was

conceived as a Hive-oriented SQL inter-

preter.) Over the last year, especially, Spark

has become one of the most hyped of

Hadoop-oriented technologies; many DI or

analytic vendors now support Spark to one

degree or another in their products. Gener-

ally speaking, most vendors now offer SQL-

on-Hadoop options of one kind or another,

while others also offer native (optimized)

ETL-on-Hadoop offerings.

Whats AheadCloud is a critical context for data inte-

gration. One reason for this is that mostproviders offer export facilities or publish

APIs that facilitate access to cloud data.

Another reasonas I wrote last yearis

that doing DI in the cloud doesnt inval-

idate (completely or, even, in large part)

existing best practices: if you want to

run advanced analytics on SaaS data,

youve either got to load it into an exist-

ing, on-premises repository oralterna-

tivelyexpose it to a cloud analytics pro-

vider. What you do in the former scenario

winds up looking a lot like what you do

with traditional DI. And the good news is

that you can do a lot more with traditional

DI tools or platforms than used to be the

case. Most data integration offerings can

parse, shred, and transform the JSONand XML used for data exchange; some

can do the same with formats such as

RDF, YAML, or Atom. Several prominent

database providers offer support for in-

database JSONs (e.g., parsing and shred-

ding JSONs via a name-value-pair func-

tion or landing and storing them intact

as variable character text), while others

offer some kind of support for in-database

storage (and querying) of JSON data. DV

vendors are typically no less accomplished

than the big DI platforms with respectto their capacity to accommodate a wide

variety of data exchange formats, from

JSON/XML to flat files.

Any account of data integration and

big data is bound to be insufficient sim-

ply because there is so much happening.

As noted, the Hadoop platform is by no

means the onlynor, for that matter, the

most excitinggame in town. Apache

Spark, which (a) runs in the context of

Hadoop and which (b) can both persist

data (to HDFS, the Hadoop Distributed

File System) and run in-memory (using

Tachyon) last year emerged as a bona

fide big data superstar. Spark is touted as

a compelling platform for both analytics

and data integration. Several DI vendors

already claim to support it to some extent.

Spark, like almost everything else in the

space, will bear watching. And so it goes.

Stephen Swoyer is a

technology writer withmore than 16 years of

experience. His writing

has focused on business

intelligence, data ware-

housing, and analytics

for almost a decade. Hes particularly

intrigued by the thorny people and pro-

cess problems most BI and DW vendors

almost never want to acknowledge, let

alone talk about. You can contact him at

[email protected].


20/52


21/52

DBTA.COM 19

industry

updates

Increasingly, firms are splitting up their

analytical teams into a model development

and a model validation team.

it is important to meticulously list all data

within the enterprise that could poten-

tially be beneficial to the analytical exer-cise. The more data, the better is the rule

here. Analytical models have sophisticated

built-in facilities to automatically decide

what data elements are important for the

task at hand and which ones can be left

out from further analysis. The best way to

improve the performance of any analytical

model is by investing in data. This can be

done by working on both the quantity and

quality simultaneously. Regarding the for-

mer, a key challenge concerns the aggre-

gation of structured (e.g., stored in rela-

tional databases) and unstructured (e.g.,

textual) data to provide a comprehensive

and holistic view on customer behavior.

Closely related to this is the integration of

offline and online data, an issue that many

companies are struggling with nowadays.

Furthermore, companies can also look

beyond their internal boundaries and con-

sider the purchase of external data from

data poolers to complement their inter-

nal analytical models. Extensive research

has indicated that this is very beneficial in

order to both perfect and benchmark theanalytical models developed.

Although data is typically available in

large quantities, its quality is often a more

painful attention point. Here the GIGO

principle applies: garbage in, garbage out,

or bad data yields bad models. This may

sound obvious at first. However, good

data quality is often the Achilles heel in

many analytical projects. Data quality can

be evaluated by various dimensions such

as data accuracy, data completeness, data

timeliness, and data consistency, to name

a few. To be successful in big data and

analytics, it is necessary for companies tocontinuously monitor and remedy data

quality problems by setting up master data

management programs and creating new

job roles such as that of data auditor, data

steward, or data quality manager.

Analytics should always start from a

business problem rather than from a spe-

cific technological solution. However, this

comes with a chicken and egg problem.

To identify new business opportunities,

one needs to be aware of the technological

potential first. As an example, think about

the area of social media analytics. By first

understanding how this technology works,

a firm can start thinking about how to

leverage this to study its online brand

perception or perform trend monitoring.

To bridge the gap between technology

and the business, continuous education

is important. It allows companies to stay

ahead of the competition and spearhead

the development of new analytical appli-

cations. At this point, the academic world

should make a mea culpa, since the offer-

ing of Master of Science programs in thearea of big data and analytics is currently

falling short of the demand.

Another important component for

turning data into concrete business

insights and adding value using analyt-

ics concerns the proper validation of the

analytical models built. Quotes such as if

you torture the data long enough, it will

confess and terms such as data massage

have cast a negative perspective on the field

of analytics. It speaks for itself that analyt-

ical models should be properly audited

and validated and many mechanisms,

procedures, and tools are available to dothis. Thats why more and more firms are

splitting up their analytical teams into a

model development and a model valida-

tion team. Good corporate governance

then dictates the construction of a Chinese

wall between both teams, such that mod-

els developed by the former team can be

objectively and independently evaluated

by the latter team. One might even con-

template having the validation performed

by an external partner. By setting up an

analytical infrastructure whereby models

are critically evaluated and validated on

an ongoing basis, a firm is capable of con-

tinuously improving its analytical models

and thus, can better target its customers.

Analytics is not a one-shot single-time

exercise. In fact, the frustrating thing is

that once an analytical model has been

built and put into production, it is out-

dated. Analytical models constantly lag

behind reality, but the gap should be as

minimal as possible. Just think about it:

An analytical model is built using a sam-

ple of data, which is gathered at a specificsnapshot in time given a specific internal

and external environment. However, these

environments are not static, but contin-

uously change because of both internal

(new strategies, changing customer behav-

ior) as well as external effects (new eco-

nomic conditions, new regulations). Think

about a fraud detection model whereby

crimina

Date post:	02-Jun-2018
Category:	Documents
Upload:	florianis
View:	222 times
Download:	0 times

Big Data Sourcebook Second Edition

Documents