of 52
8/10/2019 Big Data Sourcebook Second Edition
1/52
WWW.DBTA.COM
From the publishers of
8/10/2019 Big Data Sourcebook Second Edition
2/52
8/10/2019 Big Data Sourcebook Second Edition
3/52
introduction
2
The Big Data Frontier
Joyce Wells
industry updates
4
How Businesses Are DrivingBig Data Transformation
John OBrien
10 The Enabling Force Behind
Digital Enterprises
Joe McKendrick
14 Data Integration Evolves to Support
a Bigger Analytic Vision
Stephen Swoyer
18 Turning Data Into Value Using Analytics
Bart Baesens
22 As Clouds Roll In, Expectations forPerformance and Availability Billow
Michael Corey, Don Sullivan
26 Social Media Analytics Tools
and Platforms: The Need for Speed
Peter J. Auditore
30
The Big Data Challenge to Data Quality
Elliot King
36 Building the UnstructuredBig Data/Data Warehouse Interface
W. H. Inmon
40
Big Data Poses Security Risks
Geoff Keston
CONTENTSBIG DATA
SOURCEBOOK
DECEMBER2014
BIG DATA SOURCEBOOK is published annually by Information Today, Inc.,
143 Old Marlton Pike, Medford, NJ 08055
POSTMASTER
Send all address changes to:Big Data Sourcebook,143 Old Marlton Pike, Medford, NJ 08055Copyright 2014, Information Today, Inc. All rights reserved.
PRINTED IN THE UNITED STATES OF AMERICA
The Big Data Sourcebookis a resource for IT managers and professionals providing informationon the enterprise and technology issues surrounding the big data phenomenon and the needto better manage and extract value from large quantities of structured, unstructured andsemi-structured data. The Big Data Sourcebook provides in-depth articles on the expandingrange of NewSQL, NoSQL, Hadoop, and private/public/hybrid cloud technologies, as wellas new capabilities for traditional data management systems. Articles cover business- andtechnology-related topics, including business intelligence and advanced analytics, data securityand governance, data integration, data quality and master data management, social mediaanalytics, and data warehousing.
No part of this magazine may be reproduced and by any meansprint, electronic or anyotherwithout written permission of the publisher.
COPYRIGHT INFORMATION
Authorization to photocopy items for internal or personal use, or the internal or personal useof specific clients, is granted by Information Today, Inc., provided that the base fee of US $2.00per page is paid d irectly to Copyright Clearance Center (CCC), 222 Rosewood Drive, Danvers,MA 01923, phone 978-750-8400, fax 978-750-4744, USA. For those organizations that havebeen grated a photocopy license by CCC, a separate system of payment has been arranged.Photocopies for academic use: Persons desiring to make academic course packs with articlesfrom this journal should contact the Copyright Clearance Center to request authorizationthrough CCCs Academic Permissions Service (APS), subject to the conditions thereof. SameCCC address as above. Be sure to reference APS.
Creation of derivative works, such as informative abstracts, unless agreed to in writing by thecopyright owner, is forbidden.
Acceptance of advertisement does not imply an endorsement by Big Data Sourcebook. Big DataSourcebook disclaims responsibility for the statements, either of fact or opinion, advanced bythe contributors and/or authors.
The views in this publication are those of the authors and do not necessarily reflect the viewsof Information Today, Inc. (ITI) or the editors.
2014 Information Today, Inc.
From the publishers of
PUBLISHED BY Unisphere Mediaa Division of Information Today, Inc.
EDITORIAL & SALES OFFICE 630 Central Avenue, Murray Hill, New Providence, NJ 07974
CORPORATE HEADQUARTERS 143 Old Marlton Pike, Medford, NJ 08055
Thomas Hogan Jr., Group Publisher609-654-6266; thoganjr@infotoday
Joyce Wells, Managing Editor908-795-3704; [email protected]
Joseph McKendrick,Contributing Editor; [email protected]
Alexis Sopko, Advertising Coordinator908-795-3703; [email protected]
Adam Shepherd,Editorial and Advertising Assistant
908-795-3705
Celeste Peterson-Sloss, Lauree Padgett,Alison A. Trotta, Editorial Services
Norma Neimeister,Production Manager
Denise M. Erickson,Senior Graphic Designer
Jackie Crawford,Ad Trafficking Coordinator
Sheila Willison, Marketing Manager,Events and Circulation859-278-2223; [email protected]
DawnEl Harris, Director of Web Events;
ADVERTISING
Stephen Faig, Business Development Manager, 908-795-3702; [email protected]
INFORMATION TODAY, INC. EXECUTIVE MANAGEMENT
Thomas H. Hogan, President and CEO
Roger R. Bilboul,Chairman of the Board
John C. Yersak,Vice President and CAO
Thomas Hogan Jr., Vice President,Marketing and Business Development
Richard T. Kaser, Vice President, Content
Bill Spence, Vice President,Information Technology
8/10/2019 Big Data Sourcebook Second Edition
4/52
2 BIG DATA SOURCEBOOK 2014
T , cloud, mobility, and the prolif-
eration of connected devices, coupled with newer data
management approaches, such as Hadoop, NoSQL, and
in-memory systems, are increasing the opportunities for
enterprises to harness data. However, with this new fron-
tier there are challenges to be overcome. As they work to
maintain legacy applications and systems, IT organiza-
tions must address new demands for more timely access
to more data from more users, in addition to maintain-ing continuous availability of IT systems, and enforcing
appropriate data governance.
Its a lot to think about. How can companies choose the
right approach to leverage big data while keeping newer
technologies in line with budgetary, application availabil-
ity, and security concerns?
Over the past year, Unisphere Research, a division of
Information Today, Inc., has conducted surveys among IT
professionals to gain insight into the challenges organiza-
tions are facing.
The information overload is already taking its toll on
IT organizations and professionals. According to a Uni-
sphere Research report, Governance Moves Big Data
From Hype to Confidence, the percentage of organiza-
tions with big data projects is expected to triple by the
end of 2015. However, while organizations are investing
in increasing the information at their disposal, they are
finding that they are committing more time to simply
locating the necessary data, as opposed to actually ana-
lyzing it. In addition, the report, based on a survey of 304
data management professionals and sponsored by IBM,
found that respondents tend to be less confident about
data gathered through social media and public cloud
applications.
With all this data, there are also concerns about theability to maintain the high availability mandated by
todays stringent service level agreements. According to
another Unisphere Research survey sponsored by EMC,
and conducted among 315 members of the Indepen-
dent Oracle Users Group (IOUG), close to one-fourth
of respondents organizations have SLAs of four nines of
availability or greater, meaning that they can have only 52
minutes or less of downtime a year. The survey, Bringing
Continuous Availability to Oracle Environments, found
that more than 25% of respondents dealt with more than
8 hours of unplanned downtime during the previous
year, which they attributed to network outages, server
failures, storage failures, human error, and power outages.
As data management and access becomes more critical
to business success, Unisphere Research finds that IT pro-
fessionals are embracing their expanded roles and relish
the opportunity to work with new technologies. Increas-
ingly, they want to be at the center of the action, and are
assuming roles associated with data science, but too often
they see themselves being forced into the job of firefight-ing rather than strategic, high-value tasks. The benefits of
ongoing staff training and use of cloud and database auto-
mation are some of the approaches cited in the report,
The Vanishing Database Administrator, sponsored by
Ntirety, a division of HOSTING.
Indeed, the increasing size and complexity of data-
base environments is stretching IT resources thin, caus-
ing organizations to seek ways to automate routine tasks
to free up assets such as tapping into virtualization and
cloud. According to The Empowered Database, a report
based on a survey of 338 IOUG members, and sponsored
by VMware and EMC, nearly one-third of organizations
are using or considering a public cloud service, and almost
half are currently using or considering a private cloud.
Still, we are just at the beginning of the changes to
come as a result of big data. In a recent Unisphere Research
Quick Poll, close to one-third of enterprises, or 30%,
report they have deployed the Apache Hadoop framework
in some capacity while another 26% said they planned
to adopt Hadoop within the next year. Strikingly, 91% of
respondents at Hadoop sites will be increasing their use
of Hadoop over the next 3 years, and one-third describe
expansion plans as significant. Key functions or applica-
tions supported by Hadoop projects include analytics and
business intelligence, working with IT operational data,and supporting special projects.
To help shed light on the expanding territory of big
data, DBTA presents the second annual Big Data Source-
book,a guide to the key enterprise and technology matters
IT professionals are grappling with as they take the jour-
ney to becoming data-driven enterprises. In addition to
articles penned by subject matter experts, leading vendors
also showcase their products and approaches to gaining
value from big data projects. Together, this combination
of articles and sponsored content provides insight into the
current big data issues and opportunities.
The Big Data Frontier
By Joyce Wells
8/10/2019 Big Data Sourcebook Second Edition
5/52
DBTA.COM 3
sponsored content
I .In fact, its
a source of big data. Today, operational
databases must meet the challenges of
variety, velocity, and volume with millions
of users and billions of machines reading
and writing data via enterprise, mobile, and
web applications. The data is stored in an
operational database before its stored in anApache Hadoop distribution.
Its audits, clickstreams, customer
information, financial investments and
payments, inventory and parts, locations,
logs, messages, patient records, plays and
scores, sensor readings, scientific data, social
interactions, user and process status, user
and visitor profiles, and more.
It drives the eCommerce, energy,
entertainment, finance, gaming,
healthcare, insurance, retail, social media,
telecommunications industries, and more.
Today, operational databases must read
and write billions of values, maintain low
latency, and sustain high throughput to
meet the challenges of velocity and volume.
They must sustain millions of operations
per seconds, maintain sub-millisecond
latency, and store billions of documents
and terabytes of data. They must be able to
support the evolution of data in the form of
new attributes and new types.
The ability to meet these challenges is
necessary to support an agile enterprise.
By doing so, the agile enterprise extracts
actionable intelligence. However, time is
of the essence. When a new type of data
emerges, operational databases must store
it without delay. When the number of users
and machines increases, the operational
database must continue to provide data access
without performance degradation. When the
size of the data set increases, the operational
database must continue to store data.
These challenges are met by a)
supporting a flexible data model and b)
scaling out on commodity hardware. They
are met by NoSQL databases. They are met
by Couchbase Server. Its a scalable, high-performance, document database engineered
for reliability and availability. By supporting a
document model via JSON, it can store new
attributes and new types of data without
modification, index the data, and enable
near-real time, lightweight analytics. By
implementing a shared-nothing architecture
with no single point of failure and consistent
hashing, it can scale with ease, on-demand,
and without affecting applications. By
integrating a managed object cache and
asynchronous persistence, it can maintain
sub-millisecond response times and sustain
high throughput. Couchbase Server was
engineered for operational big data and its
requirements.
While operational databases provide real-
time data access and lightweight analytics,
they must integrate with Apache Hadoop
distributions for predictive analytics,
machine learning, and more. While
operational data feeds big data analytics,
big data analytics feed operational data. The
result is continuous refinement. By analyzingthe operational data, it can be updated to
improve operational efficiency. The result is
a big data feedback loop.
Couchbase provides and supports
a Couchbase Server plugin for Apache
Sqoop to stream data to and from Apache
Hadoop distributions. In fact, Cloudera
certified it for Cloudera Enterprise 5. In
addition, Couchbase provides and supports
a Couchbase Server plugin for Elasticsearch
to enable full text search over operational
big data.
Finally, operational databases must
meet the requirements of a global economy
in the information age. Today, users and
machines read and write data to enterprise,
mobile, and web applications from multiplecountries and regions. To maintain data
locality, operational databases must support
deployment to multiple data centers. To
maintain the highest level of data locality,
operational databases must extend to mobile
phones / tablets and connected devices.
Couchbase Server supports both
unidirectional and bidirectional cross
data center replication. It enables the agile
enterprise to deploy an operational database
to multiple data centers in multiple regions
and in multiple countries. It moves the
operational database closer to users and
machines. In addition, Couchbase Server
can extend to mobile phones / tablets and
connected devices with Couchbase Mobile.
The platform includes Couchbase Lite, and
native document database for iOS, Android,
Java / Linux, and .NET, and Couchbase Sync
Gateway to synchronization data between
local databases and remote database servers.
The combination of cross data center
replication and mobile synchronizationenables the agile enterprise to extend global
reach to individual users and machines. If
deployed to cloud infrastructure like Amazon
Web Services or Microsoft Azure, there is no
limit to how far Couchbase Server can scale
or how far the agile enterprise can reach.
COUCHBASE
www.couchbase.com
Operational Big Data
8/10/2019 Big Data Sourcebook Second Edition
6/52
industry
u
pdates
4 BIG DATA SOURCEBOOK 2014
By John OBrien
How Businesses
Are Driving Big DataTransformation
I ,we continued to watch how big data
is enabling all things big about data and its
business analytics capabilities. We also saw theemergence (and early acceptance) of Hadoop
Version 2 as a data operating platform, with
cornerstones of YARN (Yet Another Resource
Negotiator) and HDFS (Hadoop Distributed
File System). The ecosystem of Apache Foun-
dation projects has continued to mature at a
rapid pace, while vendor products continue
to join, mature, and benefit from Hadoop
improvements.
In last years Big Data Sourcebook we
highlighted several items in The State of
Big Data article worth recapping. First, we
referenced the battle over persistence for
data architectures, primarily in enterpriseadoption that dealt with the promise of
everything in Hadoop pundits and the its
OK to have another data platform. In 2014,
we witnessed the acceptance of these multi-
tiered, specific workload capability architec-
tures that, at Radiant Advisors, we refer to
as the modern data platform. With gaining
acceptance, Hadoop is here to stay and many
analysts refer to its role as inevitable. This,
naturally, is tempered with its maturity, the
ability for enterprises to find and/or train
resources, and specifying the proper first use
case project and long term strategy, such as
the data lake or enterprise data hub strategies.We also discussed how companies needed
to understand how data is data when
approaching big data with big eyes. For
the most part, in 2014 we saw mainstream
companies shift from a the sky is falling if I
dont start a big data project mindset to dis-
tinguishing big data projects as those for sit-
uations where the data wasnt typically rela-
tionally structured, or when it had volatile
schemas. Schema on read versus schema
on write benefits and situations became a
The State of Big Data in 2014
8/10/2019 Big Data Sourcebook Second Edition
7/52
DBTA.COM 5
industry
updates
The next waveof big data implementations bymainstream adopters is expected to be multiple
times larger than that of the early adopters.
The State of Big Data in 2014
much better understood term in 2014, too.
And, more importantly, we have seen an
increasing understanding that all data can
be valuable and the need to explore data
for discovery and insights.Last year, we said that 2014 would be
the race for access hill as companies
demanded better access to data in Hadoop
by business analysts and power users and
that this access no longer be restricted to
programmers. As SQL reasserted itself as
the de-facto standard for common knowl-
edge users and existing data analysis and
integration tools, the SQL access capa-
bilities of Hadoop was under incredible
pressure to improve both in performance
and capability. Continued releases by Hor-
tonworks with Hive/Tez, Cloudera Impala,
and MapR Drill initiative made orders
of magnitude performance improvements
for SQL access. The race was on: Actians
Vortex made a splash at the Hadoop Sum-
mit in June, and otherssuch as IBM and
Pivotalmade significant improvements,
too. The race in 2014 continues going into
2015 with more SQL analytic capabilities
and performance improvements.
Hadoop 2 Ushers in
the Next GenerationThe significance of Hadoop 2 has
recently started to resonate with com-
panies and enterprise architects. Mov-
ing away from its batch-oriented origins,
YARN has clearly positioned the data
operating system as two separate funda-
mental architecture components.
While the HDFS will continue to evolve
as the caretaker of data in the distributed
file system architecture with improved
name node high availability and perfor-
mance, YARN, introduced in Hadoop 2,
completely changes the paradigm of data
engines and access. Though the primary
role of YARN is still that of a resource nego-
tiator for the Hadoop cluster and focusedon managing the resource needs of tens of
thousands of jobs in the cluster, it has also
now established a new framework.
The YARN framework serves as a plug-
gable layer of YARN certified engines
designed to work the data in different
ways. Previously, MapReduce was the pri-
mary programming framework for devel-
opers to create applications that leveraged
the parallelism of the data nodes. As other
project and data engines could work with
HDFS directly without MapReduce, a
centralized resource manager was needed
that would also enable innovation for new
data engines. MapReduce became its own
YARN engine for existing Hadoop 1 legacy
code, and Hive decoupled to work with
the new Tez engine. Long recognized as
ahead of the curve, Google caused quite a
fury when it announced that MapReduce
was dead and that they would no longer
develop in it. YARN was positioned for the
future of next-generation engines.
Sometimes in 2014 we felt that the
booming big data drum was starting todie down. And, sometimes we wondered
if it only seemed that was because every-
one was chanting Storm just a bit louder.
Another major driver in the Hadoop
implementations was that big data didnt
mean fast data. The industry wanted
both big and fast: The Spark environment
is where both early adopters were writing
new applications, and the development
community was quickly developing Spark
to be a high-level project to meet those
needs. The Spark community touts itself
as lightning-fast cluster computing pri-
marily leveraging in-memory capabilities
of the data nodes, but also a newer, faster
framework than MapReduce on disk.While Spark was in its infancy in 2013,
we saw this need for big data speed being
tackled by two-tier distributed in-memory
architectures. Today, Spark is a framework
for Spark SQL, Spark Streaming, Machine
Learning, and GraphX running on
Hadoop 2s YARN architecture. In 2014,
this has been very exciting for the industry,
but many of the mainstream adopters are
patiently waiting for the early adopters to
do their magic.
Two Camps: Early Adoptersand Mainstream Adopters
For years, overwhelming data volumes,
complexity, or data science endeavors were
the primary drivers behind early big data
adopters. Many of these early adopters
were in internet-related industries, such
as search, e-commerce, social networking,
or mobile applications that were dealing
with the explosion of internet usage and
adoption.
In 2014, we saw mainstream adopters
become the next wave of big data imple-mentations that are expected to be multi-
ple times larger than the early adopters. We
define mainstream adopters as those busi-
nesses that seek to modernize their data
platforms and analytics capabilities for
competitive opportunities and to remain
relevant in a fast changing world, but are
tempered with some time to research, ana-
lyze, and adopt while maintaining current
business operations. Mainstream adopt-
ers have had pilots and proof of concepts
8/10/2019 Big Data Sourcebook Second Edition
8/52
industry
u
pdates
6 BIG DATA SOURCEBOOK 2014
The State of Big Data in 2014
for the past year or two with one or two
Hadoop distributors and now are decid-
ing how this also fits within their overall
enterprise data strategy.Leading the way for mainstream adopt-
ers is, by consequence, meeting enterprise
and IT requirements for data management,
security, data governance, and compliance
in a new, more complicated, set of data
that includes public social data, private
customer data, third-party data enrich-
ment, and storage in cloud and on-prem-
ises. Over the past year, it has often felt like
the fast-driving big data vehicle hit some
pretty thick mud to plow through, and
some in the industry argued that forc-ing Hadoop to meet the requirements of
enterprise data management was missing
the point of big data and data science. For
now, we have seen most companies agree
that risk and compliance are things that
they must take seriously moving forward.
Mainstream Adopters RedefiningCommodity Hardware
As mainstream adopters worked
through data management and governance
hurdles for enterprise IT, next up was the
startling exclamation: I thought you said
that was cheap commodity hardware?!
This has become an interesting reminder
of the roots of big data and the difference
with IT enterprise-class hardware.
The explanation goes like this. Early
developers and adopters were driven to
solve truly big data challenges. In the sim-
plest of terms, big data meant big hardware
costs and, in order to solve that economic
challenge, big data needed to run on the
lowest cost commodity hardware and
software that was designed to be fault-tolerant to cope with high failure rates with-
out disrupting service. This is the purpose
of HDFS, though HDFS does not differen-
tiate how a data node is configured and
this is where ITs standard order list differs.
Enterprise infrastructure organiza-
tions have been maintaining the data cen-
ter needs of companies for years and have
efficiently standardized orders with chosen
vendors. In this definition of commodity
servers, its more about industry standards
in parts, and no proprietary hardware
could limit the use of these servers as data
nodes (or any other server needs in the data
center). While big data implementationwith hundreds to thousands of servers per
cluster strive for the lowest cost white box
servers from less recognized industry ven-
dors with the lowest cost components, their
commodity servers can be as low as $2,000
per server. Similar servers from industry
recognized big names with their own com-
ponents or industry best of breed com-
ponents touting stringent integration and
quality testing have averaged $25,000 per
server in several recent Hadoop implemen-
tations that we have been involved with. Wehave started to coin these servers as com-
modity-plus for mainstream companies
operationalizing Hadoop clustersand
they dont seem to mind.
Another discussion that continues
from the early adopters is how a data
node should be configured. Some imple-
mentations concerned with truly big data
configure data nodes with 25 front-load-
ing bays and multi-terabyte slower SATA
drives for the highest capacity within
their cluster. Other implementations are
more concerned with performance and
opt for faster SAS drives at lower capaci-
ties but balanced with more servers in the
cluster for further increased performancefrom parallelism. Some hyper-perfor-
mance-oriented clusters will even opt for
faster SSD drives in the cluster. This also
leads to discussions regarding multi-core
CPUs and how much memory should
be in a data node. And, there have been
equations for the number of cores related
to the amount of memory and number of
drives for optimal performance of a data
node. We have seen that enterprise infra-
structure has leaned more toward fewer
nodes in a production cluster (832 data
nodes) rather than 100-plus nodes. Their
reasoning is twofold: More powerful data
nodes are actually more interchangeablewith data centers also converging data
virtualization and private cloud strate-
gies. Second, ordering more of the pow-
erful servers can yield increased volume
discounts and maintain standardization
of IT servers in the data center.
The Data Lake Gains TractionIn 2014, we saw more acceptance of
the term data lake as an enterprise data
architecture concept pushed by Horton-
works and its modern data architectureapproach. The enterprise data hub is a
similar concept promoted by Cloudera
and also has some of the industry mind-
share. Informally, we saw the data lake term
used most often by companies seeking to
understand an approach to enterprise data
strategy and roadmaps. However, we also
saw backlash from industry pundits that
called the data lake a fallacy or murky.
Terms such as data swamp and data
dump were also thrown around as how
things could go wrong without a good
strategy and governance in place. Like the
term big data, the data lake has started
out as a high-level concept to drive further
definition and patterns going forward.
Throughout 2014, we worked with
companies ready to define a clear, detailed
strategy based on the data lake concept for
enterprise data strategy. While this is pro-
found, it is very achievable with data man-
agement principles that require answers to
new questions regarding a new approach
to data architecture. Some issues are sim-
ple and more technical, such as keepingonline archiving of historical data ware-
house data still easily accessible by users
with revised service-level agreements.
Some issues are more fundamental, such as
the data lake serving as single repository of
all data including being a staging area for
the enterprise data warehouse (with lower
cost historical persistence for other uses as
data scientists are more interested in raw
unaltered data). Other concerns are a bit
more complex, such as persisting customer
YARN,introduced in
Hadoop 2, completely
changes the paradigmof
data engines and access.
8/10/2019 Big Data Sourcebook Second Edition
9/52
8/10/2019 Big Data Sourcebook Second Edition
10/52
industry
u
pdates
8 BIG DATA SOURCEBOOK 2014
or other privacy-compliant data in the data
lake for analysis purposes. Data governance
is concerned with who has access to priva-
cy-controlled data and how it is used. Datamanagement questioned the duplication
of enterprise data and consistency.
These are hard data management and
governance decisions for enterprises to
make, but they are making themand
acknowledging that patience and adapt-
ability are key for the coming years as
data technologies continue to evolve and
change the landscape. The data lake will
continue to prove itself and make a fun-
damental shift in enterprise architecture
in the coming years. When you take a stepback and watch the business and IT driv-
ers, momentum, and technology develop-
ment, you can see how the data lake will
become an epicenter in enterprise data
architecture. If you take two steps back,
you will see how 2015 developments could
begin the evolution that transforms the
data lake into a data operating system for
the enterprise, evolving beyond business
intelligence and analytics into operational
applications and further realization of ser-
vice-oriented architectures.
Whats AheadIn 2015, the mainstream adoption with
enterprise data strategies and acceptance
of the data lake will continue as data man-
agement and governance practices provide
further clarity. The cautionary tale of 2014
to ensure business outcomes drive big data
adoption, rather than the hype of previ-
ous years will likewise continue. Hadoop
is clearly here to stay and inevitable,
and will have its well-deserved seat at the
enterprise data table, along with otherdata technologies. While Hadoop wont be
taking over the world any time soon and
principle-based frameworks (such as our
own modern data platform) recognize the
evolution of both data technologies and
computing price/performance on mod-
ern data architecture. Besides the usual
maturing and improvements overall and
for existing big data tools, we predict some
major achievements in big data for 2015
that were keeping an eye on.
The Apache Spark engine will con-
tinue to mature, improve, and gain accep-
tance in 2015. With this adoption and the
incredible capabilities that it delivers, wecould start to see applications and capabil-
ities beyond our imagination. Keep an eye
for these early case studies as inspiration
for your own needs.
With deepening acceptance and recog-
nition of YARN as the standard for operat-
ing Hadoop clusters, open-source projects
and existing vendors will port their prod-
ucts to YARN certification and integration.
This will not only close the gap between
existing data technologies to work with
Hadoop clusters but more exciting will beto see data technologies port over to YARN
so that they can operate and improve their
own capabilities within Hadoop. New
engines and existing engines running on
YARN in 2015 will further influence and
drive the adoption of Hadoop in enter-
prise data architecture.
In 2014, we saw mainstream compa-
nies requiring data management features
such as security and access control. These
first steps will be critical to keep an eye on
during 2015 for your own companys data
management requirements. Our concern
here is that the sexy high-performanceworld of Spark and improved SQL capa-
bilities will get the majority of attention,
while the less sexy side of security and gov-
ernance will not mature at the same rate.
There is significant pressure to do so with
the mountain of mainstream adopters
waiting, so well keep an eye on this one.
Finally, our most exciting item to watch
in 2015 will be Hadoops subtle transfor-
mation as business drivers move it beyond
a primary write-once/read-many repu-
tation to that of full create/read/update/
delete (CRUD) operational capability at
big data scale. The benefits of the Hadoop
architecture with YARN and HDFS gowell beyond big data analytics and enter-
prise data architects can start thinking
about what a YARN data operating system
can do with operational systems. In a few
years, this could also redefine the data lake
or well simply create another label for
the industry to debate. Once big data, high
performance, and CRUD requirements are
met within Hadoop, enterprise architects
will start thinking about the economies
of scale and efficiency gained from this
next-generation architecture.
John OBrien is princi-
pal and CEO of Radiant
Advisors. With more than
25 years of experience
delivering value through
data warehousing and
business intelligence pro-
grams, OBriens unique perspective
comes from the combination of his roles
as a practitioner, consultant, and vendor
CTO in the BI industry. As a globally rec-
ognized business intelligence thought
leader, OBrien has been publishing arti-
cles and presenting at conferences in
North America and Europe for the past 10
years. His knowledge in designing, build-
ing, and growing enterprise BI systems
and teams brings real-world insights to
each role and phase within a BI pro-
gram. Today, through Radiant Advisors,
OBrien provides research, strategic advi-
sory services, and mentoring that guidecompanies in meeting the demands of
next-generation information management,
architecture, and emerging technologies.
In Q1 2014, Radiant Advisors released its
Independent Benchmark: SQL on Hadoop
Performance that captured the current
state of options and widely varying perfor-
mance. Radiant Advisors plans to release
the next benchmark 1 year later in Q1 2015
to quantify those efforts.
In 2015, watch for Hadoops
subtle transformationas
business drivers move it
beyond a primary write-once/
read-many reputation.
The State of Big Data in 2014
8/10/2019 Big Data Sourcebook Second Edition
11/52
DBTA.COM 9
sponsored content
T of enterprise solutions
has changed. It has become distributed and
real-time work. A famousNY Times writer
Thomas Friedman summarizes it succinctly,
The World is Flat. In addition to this
technological advancement, the compute
and online world is demanding real-time
answers to questions. These ever growing
and disparate data sources need to beefficiently connected to enable new discovery
and more insightful answers.
To maintain competitive advantage in
this new landscape, organizations must be
prepared to weed out the hype and focus
on proven ways to future-proof existing
systems while efficiently integrating with
new technologies to provide the required
value of real-time insight to users and
decision-makers. Companies need to focus
on the following key requirements for new
technologies to take advantage of data and
find unique business value, new revenues.
DISTRIBUTED
The world is moving towards distributed
architectures. Memory is becoming a
commodity; the Internet is easily accessible
and fairly inexpensive and with more sources
of data creating an increase in information it
is easy to understand how organizations will
require multiple, distributed data centers to
store it all.
With distributed architectures comes a
need for distributed features such as parallelingest or the ability to quickly obtain data
using multiple resources/locations to enable
real-time application access to information
that is being processed. Then there is a
need for distributed task processing, which
helps to move the processes closer to the
locations where data is stored, thus saving
time and improving query performance as a
side effect. Finally, there becomes a need for
distributed query as well. This is the ability
to perform a search of data across different
locations, quickly in order to find hidden
value within the data for improved business
decision support.
SCALABLE
The next requirement revolves around
ease of scalability. When working with
distributed architecture, it is inevitable that
companies will need to eventually scale outtheir applications across multiple locations
in order to keep up with growing data
demands. Technology that is easily scalable/
adaptable is very important in long-term
success and helps with managing ROI.
FLEXIBLE
Another requirement, due to the many
different types of data being collected, is the
ability to handle multiple data types. If a
technology is too limited in the way it needs
to collect information from structured,
unstructured, semi-structured sources,
organizations will find it difficult to grow
their solution long-term due to concerns
with data type limitations. On the other
hand, a technology that is able to natively or
alternatively store and access many types of
information from multiple data sources will
be key to enabling long-term competitive
advantage and growth.
COMPLEMENTARY
And finally, there is a need to address
existing and legacy solutions alreadyimplemented at a large scale. Most
enterprises will not be tearing out widely
implemented solutions spanning across
their organization. It is important to require
that any new technologies being assessed
have the ability to complement existing
legacy solutions as well as any potential new
technologies that may add benefit to the
business, its customers and solution/services.
Todays enterprise success depends on the
ability to obtain key information quickly and
accurately and then apply that knowledge
to your business to make more reliable
decisions. Utilizing technology that is able
to offer the peace of mind to be successful
through distributed, scalable, flexible and
complementary features is priceless.
For over a quarter century, Objectivity,
Inc.s embedded database software has
helped discover and unlock the hiddenvalue in Big Data for improved real-
time intelligence and decision support.
Objectivity focuses on storing, managing
and searching the connection details
between data. Its leading edge technologies,
InfiniteGraph, a unique distributed, scalable
graph database, and Objectivity/DB, a
distributed and scalable object management
database, enable unique search and
navigation capabilities across distributed
datasets to uncover hidden, valuable
relationships within new and existing data
for enhanced analytics and facilitate custom
distributed data management solutions for
some of the most complex and mission-
critical systems in operation around the
world today.
By working with a well-established
technology provider with long-term, proven
Big Data implementations, enterprise
companies can feel confident that the future
requirements of their organizations will be
met along with the ability to take advantage
of new technological advances to keep ahead
of the market.For more information on how to get
started with evaluating technologies for your
business, contact Objectivity, Inc. to inquire
about our complimentary 2-hour solution
review with a senior technical consultant.
Visit our website at www.objectivity.com for
more information.
OBJECTIVITY, INC.
www.objectivity.com
Big Data for Tomorrow
8/10/2019 Big Data Sourcebook Second Edition
12/52
industry
u
pdates
10 BIG DATA SOURCEBOOK 2014
By Joe McKendrick
TheEnabling Force
Behind DigitalEnterprises
F ,data management was part of
a clear and well-defined mission in organiza-
tions. Data was generated from transaction
systems, then managed, stored, and secured
within relational database management sys-
tems, with reports built and delivered to busi-
ness decision makers specs.
This rock-solid foundation of skills,
technologies, and priorities served enter-
prises well over the years. But lately, this
arrangement has been changing dramati-
cally. Driven by insatiable demand for IT
services and data insights, as well as theproliferation of new data sources and for-
mats, many organizations are embracing
new technology and methods such as cloud,
database as a service (DBaaS), and big data.
And, increasingly, mobile isnt part of a ven-
dors pitch sheet, or futuristic overview at a
conference presentation. Its part of todays
reality, a part of everyday business. Many
organizations are already providing faster
delivery of applications, differentiated prod-
ucts and services, and some are building
new customer experiences through social,
mobile, analytics, and cloud.
Over the coming year2015we will
likely see the acceleration of the following
dramatic shifts in data management:
1. More Automation to Manage
the Squeeze
There is a lot of demand coming from the
user side, but data management profession-
als often find themselves in a squeeze. Busi-
ness demand for database services as well as
associated data volumes is growing at a rateof 20% a year on average, a survey by Uni-
sphere Research finds. In contrast, most IT
organizations are experiencing flat or shrink-
ing budgets. Other factors such as substantial
testing requirements and outdated manage-
ment techniques are all contributing to a cost
escalation and slow IT response.
Database professionals report that they
spend more time managing database lifecy-
cles than anything else. A majority still over-
whelmingly perform a range of tasks manu-
ally, from patching databases to performing
upgrades. Compliance remains important
and requires attention. As databases move
into virtualized and cloud environments,
there will be a need for more comprehen-
sive enterprise-wide testing. Another recent
Unisphere Research study finds that for more
than 50% of organizations, it takes their IT
department 30 days or more to respond to
new initiatives or deploy new solutions. For
a quarter of organizations, it takes 90 days
or more. In addition, more than two-thirds
of organizations indicate that the numberof databases they manage is expanding. The
most pressing challenges they are facing as
a result of this expansion are licensing costs,
additional hardware and network costs, addi-
tional administration costs, and complexity.
(The Empowered Database: 2014 Enterprise
Platform Decisions Survey, September 2014)
As data professionals find their time
and resources squeezed between managing
increasingly large and diverse data stores,
increased user demands, and restrictive
The State of Big Data Management
8/10/2019 Big Data Sourcebook Second Edition
13/52
BUILD A SINGLE VIEW
OF YOUR CUSTOMER
Dont spend months rebuilding
data tables to combine multiple
systems.
Make decisions based on your business
needs not on the limitations of
NoSQL solutions.
Use Postgres with JSON to import your data
and relate customer to contracts and contracts
to customers and anything else.
Visit www.enterprisedb.comto learn more.
8/10/2019 Big Data Sourcebook Second Edition
14/52
industry
u
pdates
12 BIG DATA SOURCEBOOK 2014
Changes are being driven by insatiable demand
for IT services and data insights, as well as the
proliferationof new data sources and formats.
budgets, there will be greater efforts
to automate data management tasks.
Expect a big push to automation in the
year ahead.
2. Big Data Becomes Part ofNormal Day-to-Day Business
Relational data coming out of transac-
tional systems is now only part of the enter-prise equation, and will share the stage to
a greater degree with data that previously
could not be cost-effectively captured, man-
aged, analyzed, and stored. This includes
data coming in from sensors, applications,
social media, and mobile devices.
With increased implementations
of tools and platforms to manage this
dataincluding NoSQL databases and
Hadooporganizations will be better
equipped to prepare this data for con-
sumption by analytic software. A recent
survey of Database Trends and Applications
readers finds 26% now running Hadoop
within their enterprisesup from 12% 3
years ago. A majority, 63%, also now oper-
ate NoSQL databases at their locations
(DBTA Quick Poll: New Database Tech-
nologies, April 2014).
3. Cloud Opens UpDatabase as a Service
More and more, data managers and
professionals will be working with cloud-
based solutions and data, whether asso-ciated with a public cloud service, or an
in-house database-as-a-service (DBaaS)
solution. This presents many new oppor-
tunities to provide new capabilities to
organizations, as well as new challenges.
Moving to cloud means new program-
ming and data modeling approaches will
be needed. Integration between on-prem-
ises and off-premises data also will be
intensifying. Data security will be a front-
burner issue.
Recent Unisphere Research surveys
find that close to two-fifths of enterprises
either already have or are considering run-
ning database functions within a private
cloud, and about one-third are currently
using or considering a public cloud ser-
vice. For more than 25% of organizations,
usage of private-cloud services increased
over the past year.Cloud and virtualization are being
seamlessly absorbed into the jobs of most
database administrators, and in some
cases, reducing traditional activity while
expending their roles. Database as a ser-
vice (DBaaS), or running databases and
managing data within an enterprise pri-
vate cloud setting, offers data managers
and executives a means to employ shared
services to manage their fast-growing
environments. The potential advantage
of DBaaS is that database managers need
not re-create processes or environments
from scratch, as these resources can be
pre-packaged based on corporate or com-
pliance standards and made readily avail-
able within the enterprise cloud. Close to
half of enterprises say they would like
to see capacity planning services offered
through private clouds, while 40% look
for shared database resources. A sim-
ilar number would value cloud-based
services providing automated database
provisioning.
4. Virtualization and Software-DefinedData Centers on the Way
Until recently, mentioning the term
platform brought images of Windows,
mainframe, and Linux servers to mind.
However, for most enterprises, platform
has become irrelevant. This extends to
the database sphere as wellmany of the
functions associated with specific data-
bases can be abstracted away from under-
lying hardware and software.
The use of virtualization is helping
to alleviate strains being created by the
increasing size and complexity of data-
base environments. The use of virtual-
ization within database environments is
increasing. Almost two-thirds of organi-
zations in a recent Unisphere Research
survey say there have been increases over
the past year. Nearly half report that morethan 50% of their IT infrastructure is
virtualized. The most common benefits
organizations report as a result of using
virtualization within their database envi-
ronments are reduced costs, consolidation,
and standardization of their infrastructure
(The Empowered Database: 2014 Enter-
prise Platform Decisions Survey, Septem-
ber 2014).
Another emerging trendsoftware-
defined data centers, software-defined
storage, and software-defined network-
ingpromises to take this abstraction to a
new level. Within a software-defined envi-
ronment, services associated with data
centers and database servicesstorage,
data management, and provisioningare
abstracted into a virtual service layer. This
means managing, configuring, and scaling
data environments to meet new needs will
increasingly be accomplished from a sin-
gle control panel. It may take some time
to reach this stage, as many of the compo-
nents of software-defined environments
are just starting to fall into place. Expectto see significant movement in this direc-
tion in 2015.
5. Data Managers and ProfessionalsWill Lead the Drive to SecureCorporate Data
One need only look at recent headlines
to understand the importance of data
securitymajor enterprises have suf-
fered data breaches over the past year, and
in some cases, have taken CIOs and top
The State of Big Data Management
8/10/2019 Big Data Sourcebook Second Edition
15/52
DBTA.COM 13
industry
updates
executives down with them. The rise of
big data and cloud, with their more com-
plex integration requirements, accessibil-
ity, and device variety, has increased theneed for greater attention to data security
and data governance issues.
Data security has evolved to that of a
top business challenge, as villains take
advantage of lax preventive and detective
measures. In many ways, it has become an
enterprise-wide issue in search of leader-
ship. Senior executives are only too pain-
fully aware of whats at stake for their
businesses, but often dont know how to
approach the challenge. This is an oppor-
tunity for database administrators andsecurity professionals to work together,
take a leadership role, and move the enter-
prise to take action.
Over the coming year, database manag-
ers and professionals will be called upon
to be more proactive and lead their com-
panies to successfully ensure data privacy,
protect against insider threats, and address
regulatory compliance. An annual survey
by Unisphere Research for the Indepen-
dent Oracle Users Group (IOUG) finds
there is more awareness than ever of the
critical need to lock down data environ-
ments, but also organizational hurdles in
building awareness and budgetary sup-
port for enterprise data security (DBA
Security Superhero: 2014 IOUG Enterprise
Data Security Survey, October 2014).
6. Mobile Becomes an Equal Client
Mobile computing is on the rise, and
increasingly mobile devices will be the cli-
ent of choice with enterprises in the year
ahead. This means creating ways to access
and work with data over mobile devices.More analytics, for example, is being
supported within mobile apps. Some of
the leading BI and analytics solutions ven-
dors now offer mobile apps that offer dash-
boardsoften configurablethat provide
insight and visibility into operational
trends to decision makers who are outside
of their offices. While industry watchers
have been predicting the democratiza-
tion of data analytics across enterprises
for years, the arrival of mobile apps as front
end clients to BI and analytics systems may
be the ultimate gateway to easy-to-use
analytics across the enterprise. By their
very nature, mobiles apps need to bedesigned to be as simple and easy to use
as possible. Over the coming year, mobile
app access to key data-driven applications
will become part of every enterprise.
The ability to access data from any and
all devices, of course, will increase secu-
rity concerns. While many enterprises
have tacitly approved the bring your own
device (BYOD) trend in recent years,
some are looking to move to corporate-
issued devices that will help lock down
sensitive data. The coming year will seeincreased efforts to better ensure the secu-
rity of data being sent to mobile devices.
7. Storage Enters the Limelight
Storage has always been an unappreci-
ated field of endeavor. It has been almost
an afterthought, seen in disk drives and
disk arrays running somewhere in the
back of data centers. This is changing rap-
idly, as enterprises recognize that storage is
shaping their infrastructures capabilities.
Theres no question that many organiza-
tions are dealing with rapidly expand-
ing data stores. Much of todays data
growthcoming out of enterprise appli-
cationsis being exacerbated by greater
volumes of unstructured, social media and
machine-generated data making their way
into the business analytics platform. Many
enterprises are also evolving their data
assets into data lakes, in which enter-
prise data is stored up front in its raw form
and accessed when needed, versus being
loaded into purpose-built, siloed data
environments.The question becomes, then, where
and how to store all this data. The storage
approach that has worked well for orga-
nizations over the decadesproduce data
within a transaction system, then send it
downstream to a disk, and ultimately, a
tape systemis being overwhelmed by
todays data demands. Not only is the
amount of data rapidly growing, but
more users are demanding greater and
more immediate access to data, even
when it may be several weeks, months, or
years old.
Over the coming year, there will be a
push by enterprises to manage storagesmartlyversus simply adding more
disk capacity to existing systems or pur-
chasing new systems from year to year. A
recent survey by Unisphere Research finds
growing impetus toward smarter storage
solutions, which include increased stor-
age efficiency through data compression,
information lifecycle management and
consolidation, or deployment strategies
such as tiered storage. At the same time,
storage expenditures keep risingeating a
significant share of IT budgets and imped-ing other IT initiatives. For those with
significant storage issues, the share stor-
age takes out of IT budgets is even greater
(Managing Exploding Data Growth in
the Enterprise: 2014 IOUG Database Stor-
age Survey, May 2014).
Whats Ahead
The year 2015 represents new oppor-
tunities to expand and enlighten data
management practices and platforms to
meet the needs of the ever-expanding
digital enterprise. To be successful, dig-
ital business efforts need to have solid
data management practices underneath.
As enterprises go digital, they will be rely-
ing on well-managed and diverse data to
explore and reach new markets.
Joe McKendrick is an
author and independent
researcher covering inno-vation, information tech-
nology trends, and markets.
Much of his research work
is in conjunction with Uni-
sphere Research, a division of Information
Today, Inc. (ITI), for user groups including
SHARE, the Oracle Applications Users Group,
the Independent Oracle Users Group, and the
International DB2 Users Group. He is also
a regular contributor to Database Trends
and Applications,published by ITI.
The State of Big Data Management
8/10/2019 Big Data Sourcebook Second Edition
16/52
industry
u
pdates
14 BIG DATA SOURCEBOOK 2014
The State of Data Integration
By Stephen Swoyer
Data IntegrationEvolves to Supporta Bigger Analytic Vision
W made data a
hard problem is precisely the issue of access-
ing, preparing, and producing it for machine
and, ultimately, for human consumption.
What makes this a much harder problem in
the age of big data is that the information
were consuming is vectored to us from so
many different directions.
The data integration (DI) status quo is
predicated on a model of data-at-rest. The
designated final destination for data-at-rest is
(and, at least for the foreseeable future, will
remain) the data warehouse (DW). Tradi-
tionally, data of a certain type was vectored
to the DW from more or less predictable
directionsviz., OLTP systems, or flat files
and at the more or less predictable velocities
circumscribed by the limitations of the batch
model. Thanks to big data, this is no lon-ger the case. Granted, the term big data is
empty, hyperbolic, and insufficient; granted,
theres at least as much big data hype as big
data substance. But still, as a phenomenon,
big data at once describes 1) the technolog-
ical capacity to ingest, store, manage, syn-
thesize, and make use of information to an
unprecedented degree and 2) the cultural
capacity to imaginatively conceive of and
meaningfully interact with information in
fundamentally different ways. One conse-
quence of this has been the emergence of a
new DI model that doesnt so much aim to
supplant as to enrich the status quo ante. In
addition to data-at-rest, the new DI model is
able to accommodate data-in-motioni.e.,
data as it streams and data as it pulses: from
the logs or events generated by sensors or
other periodic signalers to the signatures or
anomalies that are concomitant with aperi-
odic events such as fraud, impending failure,
or service disruption.
Needless to say, comparatively little of
this information is vectoring in from con-
ventional OLTP systems. And thatas poet
Robert Frost might put itmakes all the
difference.
Beyond Description
Were used to thinking of data in termsof the predicates we attach to it. Now as ever,
we want and need to access, integrate, and
deliver data from traditional structured
sources such as OLTP DBMSs, or flat and/
or CSV files. Increasingly, however, were
alert to, or were intrigued by, the value of the
information that we believe to be locked into
multi-structured or so-called unstruc-
tured data, too. (Examples of the former
include log files and event messages; the lat-
ter is usually used as a kitchen-sink category
to encompass virtually any data-type.) Even
if we put aside the philosophical problem
of structure as such (semantics is structure;
schema is structure; a file-type is structure),
were confronted with the fact that data inte-
gration practices and methods must and
will differ for each of these different types.
The kinds of operations and transforma-
tions we use to prepare and restructure the
normalized data we extract from OLTP sys-
tems for business intelligence (BI) reporting
and analysis will prove to be insufficient (or
quite simply inapposite) when brought to
bear against these different types of data.
The problem of accessing, preparing, and
delivering unconventional types of data from
unconventional types of sourcesas well as
of making this data available to a new class
of unconventional consumersrequires newmethods and practices, to say nothing of new
(or at least complementary) tools.
This has everything to do with what might
be called a much bigger analytic vision.
Inspired by the promise of exploiting data
mining, predictive analytics, machine learn-
ing, or other types of advanced analytics on a
massive scale, the focus of DI is shifting from
that of a static, deterministic disciplinein
which a kind of two-dimensional world is
represented in a finite number of well-defined
8/10/2019 Big Data Sourcebook Second Edition
17/52
DBTA.COM 15
sponsored content
T,organizations have relied
upon a single data warehouse to serve
as the center of their data universe. This
data warehouse approach operated on a
paradigm in which the data revealed a single,
unified version of the truth. But today, both
the amount and types of data available have
increased dramatically. With the advent
of Big Data, companies now have access
to more business-relevant information
than ever before, resulting in many datarepositories to store and analyze it.
THE CHALLENGES OFMOVING BIG DATA
However, to use Big Data, you must
be able to move it, and the challenges of
moving Big Data are multi-faceted. Out of
the gate, the pipes between data repositories
remain the same size, while the data grows
at an exponential rate. The issue worsens
when traditional tools are used to attempt to
access, process and integrate this data with
other systems. Yet, companies cannot rely on
traditional data warehouses alone.
Thus, companies are increasingly
turning to Apache Hadoopthe free, open
source, scalable software for distributed
computing that handles both structured
and unstructured data. The movement
towards Hadoop is indicative of something
bigger: a new paradigm thats taking over
the business worldthat of the modern
data architecture and the data supply
chain that feeds it. The data supply chain
describes a new reality in which businessesfind themselves coordinating multiple data
sources rather than using a single data
warehouse. The data from these sources,
which often varies in content, structure, and
type, has to be integrated with data from
other departments and other target systems
within an enterprise. Big Data is rarely used
en mass. Instead, different types of data tell
different stories, and companies need to be
able to integrate all of these narratives to
inform business decisions.
HADOOPS ROLE IN THEDATA SUPPLY CHAIN
In this new world, companies must
constantly move data from one place to
another to ensure efficiency and lower costs.
Hadoop plays a significant role in the data
supply chain. However, its not an end-all
solution. The standard Hadoop toolsets
lack several critical capabilities, including
the ability to move data between Hadoop
and relational databases. The technologiesthat exist for data movement across
Hadoop are cumbersome. Companies need
solutions that make data movement to and
from Hadoop easier, faster, and more cost
effective.
While open source tools like Sqoop are
designed to deal with large amounts of data,
they are often not enough by themselves.
These tools can be difficult to use, require
specialized skills and time to implement,
typically focus only on certain types of data,
and cannot support incremental changes or
real-time feeds.
EFFECTIVELY MOVING BIG DATAINTO AND OUT OF HADOOP
The most effective answer to this
challenge is to implement solutions that are
specifically designed to ease and accelerate
the process of data movement across a broad
number of platforms. These technologies
allow IT organizations to easily move data
from one repository to another in a highly-
visible manner. The software should also
unify and integrate data from all platformswithin an enterprise, not just Hadoop. And
they should include change data capture
(CDC) technology to keep the target data up
to datein a way thats sensitive to network
bandwidth.
Attunity offers a solution for companies
looking to turbocharge the flows across their
data supply chain while fully supporting
a modern data architecture. Attunity
Replicate features a user-friendly GUI, with
a Click-to- Replicate design and drag-and-
drop functionality to move data between
repositories. Attunity supports Hadoop
as a source and as a target, as well as every
major commercial database platform and
data warehouse available. It is scalable and
manageable and can be used to move data
to and from the cloud when combined with
Attunity CloudBeam.
MAKING BIG DATA & HADOOP
WORK FOR YOU!Attunity enables companies to improve
their data flows to capitalize on all their data,
including Big Data sources. Their solutions
limit the investment a company needs to
make by reducing the hardware and software
needed for managing and moving data
across multiple platforms out-of-the box.
Additionally, Attunity solutions are high
performance and provide an easy-to-use
graphical interface that helps companies
make timely and fully-informed decisions.
Using high-performance data movement
software like Attunity, companies can not
only unleash the full power of Hadoop but
also the power of all their other technologies
to enable real-time analytics and true
competitive advantage.
To learn more,
download
this Attunity
whitepaper:
Hadoop and
the Modern DataSupply Chain
http://bit.ly/HadoopWP
ATTUNITY
www.attunity.com
Unleashing the Value of Big Data & Hadoop
8/10/2019 Big Data Sourcebook Second Edition
18/52
industry
u
pdates
16 BIG DATA SOURCEBOOK 2014
The State of Data Integration
dimensionsto a polygonal or probabilis-
tic discipline with a much greater number
of dimensions. The static stuff will still
matter and will continue to power the
great bulk of day-to-day decision making,
but this will in turn be enriched, episod-
ically, with different types of data. The
challenge for DI is to accommodate and
promote this enrichment, even as budgets
hold steady (or are adjusted only margin-ally) and resources remain constrained.
Automatic for the People
What does this mean for data integra-
tion? For one thing, the day-to-day work
of traditional DI will, over time, be sim-
plified, if not actually automated. This
work includes activities such as 1) the
exploration, identification, and mapping
of sources; 2) the creation and mainte-
nance of metadata and documentation;
3) the automation or acceleration, insofar
as feasible, of testing and quality assur-
ance; and, crucially, 4) the deployment of
new OLTP systems and data warehouses,
as well as of BI and analytic applications
or artifacts. These activities can and will
be accelerated; in some cases (as with the
generation and maintenance of metadata
or documentation) they will, for practical,
day-to-day purposes, be more or less com-
pletely automated.
This is in part a function of the matu-
rity of the available tooling. Most DI and
RDBMS vendors ship platform-specificautomation features (pre-fab source con-
nectivity and transformation wizards; data
model design, generation, and conversion
tools; SQL, script, and even procedural
code generators; scheduling facilities; in
some cases even automated dev-testing
routines) with their respective tools. Sim-
ilarly, a passel of smaller, self-styled data
warehouse automation vendors market
platform-independent tools that purport
to automate most of the same kinds of
activities, and which are also optimized for
multiple target platforms. On top of this,
data virtualization (DV) and on-prem-
ises-to-cloud integration specialists can
bring intriguing technologies to bear, too.
Most DI vendors offer DV (or data fed-
eration) capabilities of some kind; others
market DV-only products. None of these
tools is in any sense a silver bullet: cus-
tom-fitting and design of some kind isstill required andfranklyalways will
be required. The catch, of course, is that
even though such tools can likewise help
to accelerate key aspects of the day-to-day
work of building, managing, optimizing,
maintaining, or upgrading OLTP and BI/
decision support systems, they cant and
wont replace human creativity and inge-
nuity. The important thing is that they
give us the capacity to substantively accel-
erate much of the heavy-lifting of the work
of data integration.
Big Data Integration:Still a Relatively New Frontier
This just isnt the case in the big data
world. As Douglas Adams, author of The
Hitchhikers Guide to the Galaxy,might put
it, traditional data integration tools or ser-
vices are mature and robust in exactly the
way that big data DI toolsarent.
At this point, guided and/or self-
service features (to say nothing of man-
agement-automation amenities) are still
mostly missing from the big data offerings.As a result, organizations will need more
developers and more technologists to do
more hands-on stuff when theyre doing
data integration in conjunction with big
data platforms.
Industry luminary Richard Winter
tackled this issue in a report entitled The
Real Cost of Big Data, which highlights
the cost disparity between using Hadoop
as a landing area and/or persistent store
for data versus using it as a platform for
business intelligence (BI) and decision
support workloads. As a platform for data
ingestion, persistence, and preparation,
the research suggests, Hadoop is orders of
magnitude cheaper than a conventional
OLTP or DW system. Conversely, the cost
of using Hadoop as a primary platform
for BI and analytic workloads is orders of
magnitude more expensive.
An issue that tends to get glossed overis that of Hadoops efficacy as a data man-
agement platform. Managing data isnt
simply a question of ingesting and stor-
ing it; its likewise, and to a much greater
extent, a question of retrieving just the
right data, of preparing it in just the right
format, and of delivering it at more or less
the right time. In other words, big data
tools arent only less productive, than are
those of traditional BI and decision sup-
port, but big data management platforms
are themselves comparatively immature,
too. Generally speaking, they lack support
for key database features or for core trans-
action-processing concepts, such as ACID
integrity. The simple reason for this is that
many platforms either arent databases
or eschew conventional DBMS reliabil-
ity and concurrency features to address
scaling-specific or application-specific
requirements. The upshot, then, is that
the human focus of data integration
is shifting and will continue to shift to
Hadoop and other big data platforms
not least because these platforms tend torequire considerable human oversight and
intervention.
This doesnt mean that data, appli-
cations, and other resources are shifting
or will shift to big data platforms, never
to return or to be recirculated. For one
thing, theres cloud, which is having no
less a profound impact on data integra-
tion and data management. Data must be
vectored from big data platforms (in the
cloud or on-premises) to other big data
The new DI model is able to accommodate
data-in-motioni.e., data as it streams
and data as it pulses.
8/10/2019 Big Data Sourcebook Second Edition
19/52
DBTA.COM 17
industry
updates
The State of Data Integration
platforms (in the cloud or on-premises),
to the cloud in generali.e., to SaaS, plat-
form-as-a-service (PaaS), and infrastruc-
ture-as-a-service (IaaS) resourcesand,
last but not least, to good old on-premises
resources like applications and databases.Theres no shortage of data exchange
formats for integrating data in this con-
textJSON and XML foremost among
thembut the venerable SQL language
will continue to be an important and even
a preferred mechanism for data integra-
tion in on-premises, big data, and even
cloud environments. The reasons for this
are many. First, SQL is an extremely effi-
cient and productive language: According
to a tally compiled by Andrew Binstock,
editor-in-chief of Dr. Dobbs Journal, SQLtrails only legacy languages such as .ASP
and Visual Basic (at number 1 and 2,
respectively) and Java (at number 3) pro-
ductivity-wise. (Binstock based his tally
on data sourced from the International
Software Benchmarking Standards Group,
or ISBSG, which maintains a database
of more than 6,000 software projects.)
Second, theres a surfeit of available SQL
query interfaces and/or adapters, along
with (to a lesser extent) of SQL-savvy cod-
ers. Third, open source software (OSS) and
proprietary vendors have expended a sim-
ply shocking amount of effort to develop
ANSI-SQL-on-Hadoop technologies. This
is a very good thing, chiefly because SQL
is arguably the single most promising tool
for getting the right data in the right for-
mat out of Hadoop.
Two years ago, for example, the most
efficient ways to get data out of Hadoop
included:
1. Writing MapReduce jobs in Java
in order to translate the simple
dependency, linear chain, or directedacyclic graph (DAG) operations
involved in data engineering into map
and reduce operations;
2. Writing jobs in PigLatin for Hadoops
Pig framework to achieve basically the
same thing;
3. Writing SQL-like queries in Hive
Query Language (HiveQL) to achieve
basically the same thing; or
4. Exploiting bleeding-edge technologies
(such as Cascading, an API layered
on top of Hadoop thats supposed to
make it easier to program/manage) to
achieve basically the same thing.
Today, theres no shortage of mecha-
nisms to get data from Hadoop. Take Hive,an interpreter that compiles HiveQL que-
ries into MapReduce jobs. As of Hadoop
2.x, Hive can leverage both Hadoops
MapReduce engine or the new Apache Tez
framework. Tez is just one of several designs
that exploit Hadoops new resource man-
ager, YARN, which makes it easier to manage
and allocate resources for multiple compute
engines, in addition to MapReduce. Thus,
Apache Tezwhich is optimized for the
operations, such as DAGs, that are charac-
teristic of data transformation workloadsnow offers features such as pipelining and
interactivity for ETL-on-Hadoop. Theres
also Apache Spark, a cluster computing
framework that can run in the context of
Hadoop. Its touted as a high-performance
complement and/or alternative to Hadoops
built-in MapReduce compute engine; as of
version 1.0.0, Spark is paired with Spark
SQL, a new, comparatively immature, SQL
interpreter. (Spark SQL replaces a prede-
cessor project, dubbed Shark, which was
conceived as a Hive-oriented SQL inter-
preter.) Over the last year, especially, Spark
has become one of the most hyped of
Hadoop-oriented technologies; many DI or
analytic vendors now support Spark to one
degree or another in their products. Gener-
ally speaking, most vendors now offer SQL-
on-Hadoop options of one kind or another,
while others also offer native (optimized)
ETL-on-Hadoop offerings.
Whats AheadCloud is a critical context for data inte-
gration. One reason for this is that mostproviders offer export facilities or publish
APIs that facilitate access to cloud data.
Another reasonas I wrote last yearis
that doing DI in the cloud doesnt inval-
idate (completely or, even, in large part)
existing best practices: if you want to
run advanced analytics on SaaS data,
youve either got to load it into an exist-
ing, on-premises repository oralterna-
tivelyexpose it to a cloud analytics pro-
vider. What you do in the former scenario
winds up looking a lot like what you do
with traditional DI. And the good news is
that you can do a lot more with traditional
DI tools or platforms than used to be the
case. Most data integration offerings can
parse, shred, and transform the JSONand XML used for data exchange; some
can do the same with formats such as
RDF, YAML, or Atom. Several prominent
database providers offer support for in-
database JSONs (e.g., parsing and shred-
ding JSONs via a name-value-pair func-
tion or landing and storing them intact
as variable character text), while others
offer some kind of support for in-database
storage (and querying) of JSON data. DV
vendors are typically no less accomplished
than the big DI platforms with respectto their capacity to accommodate a wide
variety of data exchange formats, from
JSON/XML to flat files.
Any account of data integration and
big data is bound to be insufficient sim-
ply because there is so much happening.
As noted, the Hadoop platform is by no
means the onlynor, for that matter, the
most excitinggame in town. Apache
Spark, which (a) runs in the context of
Hadoop and which (b) can both persist
data (to HDFS, the Hadoop Distributed
File System) and run in-memory (using
Tachyon) last year emerged as a bona
fide big data superstar. Spark is touted as
a compelling platform for both analytics
and data integration. Several DI vendors
already claim to support it to some extent.
Spark, like almost everything else in the
space, will bear watching. And so it goes.
Stephen Swoyer is a
technology writer withmore than 16 years of
experience. His writing
has focused on business
intelligence, data ware-
housing, and analytics
for almost a decade. Hes particularly
intrigued by the thorny people and pro-
cess problems most BI and DW vendors
almost never want to acknowledge, let
alone talk about. You can contact him at
8/10/2019 Big Data Sourcebook Second Edition
20/52
8/10/2019 Big Data Sourcebook Second Edition
21/52
DBTA.COM 19
industry
updates
Increasingly, firms are splitting up their
analytical teams into a model development
and a model validation team.
it is important to meticulously list all data
within the enterprise that could poten-
tially be beneficial to the analytical exer-cise. The more data, the better is the rule
here. Analytical models have sophisticated
built-in facilities to automatically decide
what data elements are important for the
task at hand and which ones can be left
out from further analysis. The best way to
improve the performance of any analytical
model is by investing in data. This can be
done by working on both the quantity and
quality simultaneously. Regarding the for-
mer, a key challenge concerns the aggre-
gation of structured (e.g., stored in rela-
tional databases) and unstructured (e.g.,
textual) data to provide a comprehensive
and holistic view on customer behavior.
Closely related to this is the integration of
offline and online data, an issue that many
companies are struggling with nowadays.
Furthermore, companies can also look
beyond their internal boundaries and con-
sider the purchase of external data from
data poolers to complement their inter-
nal analytical models. Extensive research
has indicated that this is very beneficial in
order to both perfect and benchmark theanalytical models developed.
Although data is typically available in
large quantities, its quality is often a more
painful attention point. Here the GIGO
principle applies: garbage in, garbage out,
or bad data yields bad models. This may
sound obvious at first. However, good
data quality is often the Achilles heel in
many analytical projects. Data quality can
be evaluated by various dimensions such
as data accuracy, data completeness, data
timeliness, and data consistency, to name
a few. To be successful in big data and
analytics, it is necessary for companies tocontinuously monitor and remedy data
quality problems by setting up master data
management programs and creating new
job roles such as that of data auditor, data
steward, or data quality manager.
Analytics should always start from a
business problem rather than from a spe-
cific technological solution. However, this
comes with a chicken and egg problem.
To identify new business opportunities,
one needs to be aware of the technological
potential first. As an example, think about
the area of social media analytics. By first
understanding how this technology works,
a firm can start thinking about how to
leverage this to study its online brand
perception or perform trend monitoring.
To bridge the gap between technology
and the business, continuous education
is important. It allows companies to stay
ahead of the competition and spearhead
the development of new analytical appli-
cations. At this point, the academic world
should make a mea culpa, since the offer-
ing of Master of Science programs in thearea of big data and analytics is currently
falling short of the demand.
Another important component for
turning data into concrete business
insights and adding value using analyt-
ics concerns the proper validation of the
analytical models built. Quotes such as if
you torture the data long enough, it will
confess and terms such as data massage
have cast a negative perspective on the field
of analytics. It speaks for itself that analyt-
ical models should be properly audited
and validated and many mechanisms,
procedures, and tools are available to dothis. Thats why more and more firms are
splitting up their analytical teams into a
model development and a model valida-
tion team. Good corporate governance
then dictates the construction of a Chinese
wall between both teams, such that mod-
els developed by the former team can be
objectively and independently evaluated
by the latter team. One might even con-
template having the validation performed
by an external partner. By setting up an
analytical infrastructure whereby models
are critically evaluated and validated on
an ongoing basis, a firm is capable of con-
tinuously improving its analytical models
and thus, can better target its customers.
Analytics is not a one-shot single-time
exercise. In fact, the frustrating thing is
that once an analytical model has been
built and put into production, it is out-
dated. Analytical models constantly lag
behind reality, but the gap should be as
minimal as possible. Just think about it:
An analytical model is built using a sam-
ple of data, which is gathered at a specificsnapshot in time given a specific internal
and external environment. However, these
environments are not static, but contin-
uously change because of both internal
(new strategies, changing customer behav-
ior) as well as external effects (new eco-
nomic conditions, new regulations). Think
about a fraud detection model whereby
crimina