+ All Categories
Home > Documents > Big Data Sourcebook Second Edition

Big Data Sourcebook Second Edition

Date post: 02-Jun-2018
Category:
Upload: florianis
View: 222 times
Download: 0 times
Share this document with a friend

of 52

Transcript
  • 8/10/2019 Big Data Sourcebook Second Edition

    1/52

    WWW.DBTA.COM

    From the publishers of

  • 8/10/2019 Big Data Sourcebook Second Edition

    2/52

  • 8/10/2019 Big Data Sourcebook Second Edition

    3/52

    introduction

    2

    The Big Data Frontier

    Joyce Wells

    industry updates

    4

    How Businesses Are DrivingBig Data Transformation

    John OBrien

    10 The Enabling Force Behind

    Digital Enterprises

    Joe McKendrick

    14 Data Integration Evolves to Support

    a Bigger Analytic Vision

    Stephen Swoyer

    18 Turning Data Into Value Using Analytics

    Bart Baesens

    22 As Clouds Roll In, Expectations forPerformance and Availability Billow

    Michael Corey, Don Sullivan

    26 Social Media Analytics Tools

    and Platforms: The Need for Speed

    Peter J. Auditore

    30

    The Big Data Challenge to Data Quality

    Elliot King

    36 Building the UnstructuredBig Data/Data Warehouse Interface

    W. H. Inmon

    40

    Big Data Poses Security Risks

    Geoff Keston

    CONTENTSBIG DATA

    SOURCEBOOK

    DECEMBER2014

    BIG DATA SOURCEBOOK is published annually by Information Today, Inc.,

    143 Old Marlton Pike, Medford, NJ 08055

    POSTMASTER

    Send all address changes to:Big Data Sourcebook,143 Old Marlton Pike, Medford, NJ 08055Copyright 2014, Information Today, Inc. All rights reserved.

    PRINTED IN THE UNITED STATES OF AMERICA

    The Big Data Sourcebookis a resource for IT managers and professionals providing informationon the enterprise and technology issues surrounding the big data phenomenon and the needto better manage and extract value from large quantities of structured, unstructured andsemi-structured data. The Big Data Sourcebook provides in-depth articles on the expandingrange of NewSQL, NoSQL, Hadoop, and private/public/hybrid cloud technologies, as wellas new capabilities for traditional data management systems. Articles cover business- andtechnology-related topics, including business intelligence and advanced analytics, data securityand governance, data integration, data quality and master data management, social mediaanalytics, and data warehousing.

    No part of this magazine may be reproduced and by any meansprint, electronic or anyotherwithout written permission of the publisher.

    COPYRIGHT INFORMATION

    Authorization to photocopy items for internal or personal use, or the internal or personal useof specific clients, is granted by Information Today, Inc., provided that the base fee of US $2.00per page is paid d irectly to Copyright Clearance Center (CCC), 222 Rosewood Drive, Danvers,MA 01923, phone 978-750-8400, fax 978-750-4744, USA. For those organizations that havebeen grated a photocopy license by CCC, a separate system of payment has been arranged.Photocopies for academic use: Persons desiring to make academic course packs with articlesfrom this journal should contact the Copyright Clearance Center to request authorizationthrough CCCs Academic Permissions Service (APS), subject to the conditions thereof. SameCCC address as above. Be sure to reference APS.

    Creation of derivative works, such as informative abstracts, unless agreed to in writing by thecopyright owner, is forbidden.

    Acceptance of advertisement does not imply an endorsement by Big Data Sourcebook. Big DataSourcebook disclaims responsibility for the statements, either of fact or opinion, advanced bythe contributors and/or authors.

    The views in this publication are those of the authors and do not necessarily reflect the viewsof Information Today, Inc. (ITI) or the editors.

    2014 Information Today, Inc.

    From the publishers of

    PUBLISHED BY Unisphere Mediaa Division of Information Today, Inc.

    EDITORIAL & SALES OFFICE 630 Central Avenue, Murray Hill, New Providence, NJ 07974

    CORPORATE HEADQUARTERS 143 Old Marlton Pike, Medford, NJ 08055

    Thomas Hogan Jr., Group Publisher609-654-6266; thoganjr@infotoday

    Joyce Wells, Managing Editor908-795-3704; [email protected]

    Joseph McKendrick,Contributing Editor; [email protected]

    Alexis Sopko, Advertising Coordinator908-795-3703; [email protected]

    Adam Shepherd,Editorial and Advertising Assistant

    908-795-3705

    Celeste Peterson-Sloss, Lauree Padgett,Alison A. Trotta, Editorial Services

    Norma Neimeister,Production Manager

    Denise M. Erickson,Senior Graphic Designer

    Jackie Crawford,Ad Trafficking Coordinator

    Sheila Willison, Marketing Manager,Events and Circulation859-278-2223; [email protected]

    DawnEl Harris, Director of Web Events;

    [email protected]

    ADVERTISING

    Stephen Faig, Business Development Manager, 908-795-3702; [email protected]

    INFORMATION TODAY, INC. EXECUTIVE MANAGEMENT

    Thomas H. Hogan, President and CEO

    Roger R. Bilboul,Chairman of the Board

    John C. Yersak,Vice President and CAO

    Thomas Hogan Jr., Vice President,Marketing and Business Development

    Richard T. Kaser, Vice President, Content

    Bill Spence, Vice President,Information Technology

  • 8/10/2019 Big Data Sourcebook Second Edition

    4/52

    2 BIG DATA SOURCEBOOK 2014

    T , cloud, mobility, and the prolif-

    eration of connected devices, coupled with newer data

    management approaches, such as Hadoop, NoSQL, and

    in-memory systems, are increasing the opportunities for

    enterprises to harness data. However, with this new fron-

    tier there are challenges to be overcome. As they work to

    maintain legacy applications and systems, IT organiza-

    tions must address new demands for more timely access

    to more data from more users, in addition to maintain-ing continuous availability of IT systems, and enforcing

    appropriate data governance.

    Its a lot to think about. How can companies choose the

    right approach to leverage big data while keeping newer

    technologies in line with budgetary, application availabil-

    ity, and security concerns?

    Over the past year, Unisphere Research, a division of

    Information Today, Inc., has conducted surveys among IT

    professionals to gain insight into the challenges organiza-

    tions are facing.

    The information overload is already taking its toll on

    IT organizations and professionals. According to a Uni-

    sphere Research report, Governance Moves Big Data

    From Hype to Confidence, the percentage of organiza-

    tions with big data projects is expected to triple by the

    end of 2015. However, while organizations are investing

    in increasing the information at their disposal, they are

    finding that they are committing more time to simply

    locating the necessary data, as opposed to actually ana-

    lyzing it. In addition, the report, based on a survey of 304

    data management professionals and sponsored by IBM,

    found that respondents tend to be less confident about

    data gathered through social media and public cloud

    applications.

    With all this data, there are also concerns about theability to maintain the high availability mandated by

    todays stringent service level agreements. According to

    another Unisphere Research survey sponsored by EMC,

    and conducted among 315 members of the Indepen-

    dent Oracle Users Group (IOUG), close to one-fourth

    of respondents organizations have SLAs of four nines of

    availability or greater, meaning that they can have only 52

    minutes or less of downtime a year. The survey, Bringing

    Continuous Availability to Oracle Environments, found

    that more than 25% of respondents dealt with more than

    8 hours of unplanned downtime during the previous

    year, which they attributed to network outages, server

    failures, storage failures, human error, and power outages.

    As data management and access becomes more critical

    to business success, Unisphere Research finds that IT pro-

    fessionals are embracing their expanded roles and relish

    the opportunity to work with new technologies. Increas-

    ingly, they want to be at the center of the action, and are

    assuming roles associated with data science, but too often

    they see themselves being forced into the job of firefight-ing rather than strategic, high-value tasks. The benefits of

    ongoing staff training and use of cloud and database auto-

    mation are some of the approaches cited in the report,

    The Vanishing Database Administrator, sponsored by

    Ntirety, a division of HOSTING.

    Indeed, the increasing size and complexity of data-

    base environments is stretching IT resources thin, caus-

    ing organizations to seek ways to automate routine tasks

    to free up assets such as tapping into virtualization and

    cloud. According to The Empowered Database, a report

    based on a survey of 338 IOUG members, and sponsored

    by VMware and EMC, nearly one-third of organizations

    are using or considering a public cloud service, and almost

    half are currently using or considering a private cloud.

    Still, we are just at the beginning of the changes to

    come as a result of big data. In a recent Unisphere Research

    Quick Poll, close to one-third of enterprises, or 30%,

    report they have deployed the Apache Hadoop framework

    in some capacity while another 26% said they planned

    to adopt Hadoop within the next year. Strikingly, 91% of

    respondents at Hadoop sites will be increasing their use

    of Hadoop over the next 3 years, and one-third describe

    expansion plans as significant. Key functions or applica-

    tions supported by Hadoop projects include analytics and

    business intelligence, working with IT operational data,and supporting special projects.

    To help shed light on the expanding territory of big

    data, DBTA presents the second annual Big Data Source-

    book,a guide to the key enterprise and technology matters

    IT professionals are grappling with as they take the jour-

    ney to becoming data-driven enterprises. In addition to

    articles penned by subject matter experts, leading vendors

    also showcase their products and approaches to gaining

    value from big data projects. Together, this combination

    of articles and sponsored content provides insight into the

    current big data issues and opportunities.

    The Big Data Frontier

    By Joyce Wells

  • 8/10/2019 Big Data Sourcebook Second Edition

    5/52

    DBTA.COM 3

    sponsored content

    I .In fact, its

    a source of big data. Today, operational

    databases must meet the challenges of

    variety, velocity, and volume with millions

    of users and billions of machines reading

    and writing data via enterprise, mobile, and

    web applications. The data is stored in an

    operational database before its stored in anApache Hadoop distribution.

    Its audits, clickstreams, customer

    information, financial investments and

    payments, inventory and parts, locations,

    logs, messages, patient records, plays and

    scores, sensor readings, scientific data, social

    interactions, user and process status, user

    and visitor profiles, and more.

    It drives the eCommerce, energy,

    entertainment, finance, gaming,

    healthcare, insurance, retail, social media,

    telecommunications industries, and more.

    Today, operational databases must read

    and write billions of values, maintain low

    latency, and sustain high throughput to

    meet the challenges of velocity and volume.

    They must sustain millions of operations

    per seconds, maintain sub-millisecond

    latency, and store billions of documents

    and terabytes of data. They must be able to

    support the evolution of data in the form of

    new attributes and new types.

    The ability to meet these challenges is

    necessary to support an agile enterprise.

    By doing so, the agile enterprise extracts

    actionable intelligence. However, time is

    of the essence. When a new type of data

    emerges, operational databases must store

    it without delay. When the number of users

    and machines increases, the operational

    database must continue to provide data access

    without performance degradation. When the

    size of the data set increases, the operational

    database must continue to store data.

    These challenges are met by a)

    supporting a flexible data model and b)

    scaling out on commodity hardware. They

    are met by NoSQL databases. They are met

    by Couchbase Server. Its a scalable, high-performance, document database engineered

    for reliability and availability. By supporting a

    document model via JSON, it can store new

    attributes and new types of data without

    modification, index the data, and enable

    near-real time, lightweight analytics. By

    implementing a shared-nothing architecture

    with no single point of failure and consistent

    hashing, it can scale with ease, on-demand,

    and without affecting applications. By

    integrating a managed object cache and

    asynchronous persistence, it can maintain

    sub-millisecond response times and sustain

    high throughput. Couchbase Server was

    engineered for operational big data and its

    requirements.

    While operational databases provide real-

    time data access and lightweight analytics,

    they must integrate with Apache Hadoop

    distributions for predictive analytics,

    machine learning, and more. While

    operational data feeds big data analytics,

    big data analytics feed operational data. The

    result is continuous refinement. By analyzingthe operational data, it can be updated to

    improve operational efficiency. The result is

    a big data feedback loop.

    Couchbase provides and supports

    a Couchbase Server plugin for Apache

    Sqoop to stream data to and from Apache

    Hadoop distributions. In fact, Cloudera

    certified it for Cloudera Enterprise 5. In

    addition, Couchbase provides and supports

    a Couchbase Server plugin for Elasticsearch

    to enable full text search over operational

    big data.

    Finally, operational databases must

    meet the requirements of a global economy

    in the information age. Today, users and

    machines read and write data to enterprise,

    mobile, and web applications from multiplecountries and regions. To maintain data

    locality, operational databases must support

    deployment to multiple data centers. To

    maintain the highest level of data locality,

    operational databases must extend to mobile

    phones / tablets and connected devices.

    Couchbase Server supports both

    unidirectional and bidirectional cross

    data center replication. It enables the agile

    enterprise to deploy an operational database

    to multiple data centers in multiple regions

    and in multiple countries. It moves the

    operational database closer to users and

    machines. In addition, Couchbase Server

    can extend to mobile phones / tablets and

    connected devices with Couchbase Mobile.

    The platform includes Couchbase Lite, and

    native document database for iOS, Android,

    Java / Linux, and .NET, and Couchbase Sync

    Gateway to synchronization data between

    local databases and remote database servers.

    The combination of cross data center

    replication and mobile synchronizationenables the agile enterprise to extend global

    reach to individual users and machines. If

    deployed to cloud infrastructure like Amazon

    Web Services or Microsoft Azure, there is no

    limit to how far Couchbase Server can scale

    or how far the agile enterprise can reach.

    COUCHBASE

    www.couchbase.com

    Operational Big Data

  • 8/10/2019 Big Data Sourcebook Second Edition

    6/52

    industry

    u

    pdates

    4 BIG DATA SOURCEBOOK 2014

    By John OBrien

    How Businesses

    Are Driving Big DataTransformation

    I ,we continued to watch how big data

    is enabling all things big about data and its

    business analytics capabilities. We also saw theemergence (and early acceptance) of Hadoop

    Version 2 as a data operating platform, with

    cornerstones of YARN (Yet Another Resource

    Negotiator) and HDFS (Hadoop Distributed

    File System). The ecosystem of Apache Foun-

    dation projects has continued to mature at a

    rapid pace, while vendor products continue

    to join, mature, and benefit from Hadoop

    improvements.

    In last years Big Data Sourcebook we

    highlighted several items in The State of

    Big Data article worth recapping. First, we

    referenced the battle over persistence for

    data architectures, primarily in enterpriseadoption that dealt with the promise of

    everything in Hadoop pundits and the its

    OK to have another data platform. In 2014,

    we witnessed the acceptance of these multi-

    tiered, specific workload capability architec-

    tures that, at Radiant Advisors, we refer to

    as the modern data platform. With gaining

    acceptance, Hadoop is here to stay and many

    analysts refer to its role as inevitable. This,

    naturally, is tempered with its maturity, the

    ability for enterprises to find and/or train

    resources, and specifying the proper first use

    case project and long term strategy, such as

    the data lake or enterprise data hub strategies.We also discussed how companies needed

    to understand how data is data when

    approaching big data with big eyes. For

    the most part, in 2014 we saw mainstream

    companies shift from a the sky is falling if I

    dont start a big data project mindset to dis-

    tinguishing big data projects as those for sit-

    uations where the data wasnt typically rela-

    tionally structured, or when it had volatile

    schemas. Schema on read versus schema

    on write benefits and situations became a

    The State of Big Data in 2014

  • 8/10/2019 Big Data Sourcebook Second Edition

    7/52

    DBTA.COM 5

    industry

    updates

    The next waveof big data implementations bymainstream adopters is expected to be multiple

    times larger than that of the early adopters.

    The State of Big Data in 2014

    much better understood term in 2014, too.

    And, more importantly, we have seen an

    increasing understanding that all data can

    be valuable and the need to explore data

    for discovery and insights.Last year, we said that 2014 would be

    the race for access hill as companies

    demanded better access to data in Hadoop

    by business analysts and power users and

    that this access no longer be restricted to

    programmers. As SQL reasserted itself as

    the de-facto standard for common knowl-

    edge users and existing data analysis and

    integration tools, the SQL access capa-

    bilities of Hadoop was under incredible

    pressure to improve both in performance

    and capability. Continued releases by Hor-

    tonworks with Hive/Tez, Cloudera Impala,

    and MapR Drill initiative made orders

    of magnitude performance improvements

    for SQL access. The race was on: Actians

    Vortex made a splash at the Hadoop Sum-

    mit in June, and otherssuch as IBM and

    Pivotalmade significant improvements,

    too. The race in 2014 continues going into

    2015 with more SQL analytic capabilities

    and performance improvements.

    Hadoop 2 Ushers in

    the Next GenerationThe significance of Hadoop 2 has

    recently started to resonate with com-

    panies and enterprise architects. Mov-

    ing away from its batch-oriented origins,

    YARN has clearly positioned the data

    operating system as two separate funda-

    mental architecture components.

    While the HDFS will continue to evolve

    as the caretaker of data in the distributed

    file system architecture with improved

    name node high availability and perfor-

    mance, YARN, introduced in Hadoop 2,

    completely changes the paradigm of data

    engines and access. Though the primary

    role of YARN is still that of a resource nego-

    tiator for the Hadoop cluster and focusedon managing the resource needs of tens of

    thousands of jobs in the cluster, it has also

    now established a new framework.

    The YARN framework serves as a plug-

    gable layer of YARN certified engines

    designed to work the data in different

    ways. Previously, MapReduce was the pri-

    mary programming framework for devel-

    opers to create applications that leveraged

    the parallelism of the data nodes. As other

    project and data engines could work with

    HDFS directly without MapReduce, a

    centralized resource manager was needed

    that would also enable innovation for new

    data engines. MapReduce became its own

    YARN engine for existing Hadoop 1 legacy

    code, and Hive decoupled to work with

    the new Tez engine. Long recognized as

    ahead of the curve, Google caused quite a

    fury when it announced that MapReduce

    was dead and that they would no longer

    develop in it. YARN was positioned for the

    future of next-generation engines.

    Sometimes in 2014 we felt that the

    booming big data drum was starting todie down. And, sometimes we wondered

    if it only seemed that was because every-

    one was chanting Storm just a bit louder.

    Another major driver in the Hadoop

    implementations was that big data didnt

    mean fast data. The industry wanted

    both big and fast: The Spark environment

    is where both early adopters were writing

    new applications, and the development

    community was quickly developing Spark

    to be a high-level project to meet those

    needs. The Spark community touts itself

    as lightning-fast cluster computing pri-

    marily leveraging in-memory capabilities

    of the data nodes, but also a newer, faster

    framework than MapReduce on disk.While Spark was in its infancy in 2013,

    we saw this need for big data speed being

    tackled by two-tier distributed in-memory

    architectures. Today, Spark is a framework

    for Spark SQL, Spark Streaming, Machine

    Learning, and GraphX running on

    Hadoop 2s YARN architecture. In 2014,

    this has been very exciting for the industry,

    but many of the mainstream adopters are

    patiently waiting for the early adopters to

    do their magic.

    Two Camps: Early Adoptersand Mainstream Adopters

    For years, overwhelming data volumes,

    complexity, or data science endeavors were

    the primary drivers behind early big data

    adopters. Many of these early adopters

    were in internet-related industries, such

    as search, e-commerce, social networking,

    or mobile applications that were dealing

    with the explosion of internet usage and

    adoption.

    In 2014, we saw mainstream adopters

    become the next wave of big data imple-mentations that are expected to be multi-

    ple times larger than the early adopters. We

    define mainstream adopters as those busi-

    nesses that seek to modernize their data

    platforms and analytics capabilities for

    competitive opportunities and to remain

    relevant in a fast changing world, but are

    tempered with some time to research, ana-

    lyze, and adopt while maintaining current

    business operations. Mainstream adopt-

    ers have had pilots and proof of concepts

  • 8/10/2019 Big Data Sourcebook Second Edition

    8/52

    industry

    u

    pdates

    6 BIG DATA SOURCEBOOK 2014

    The State of Big Data in 2014

    for the past year or two with one or two

    Hadoop distributors and now are decid-

    ing how this also fits within their overall

    enterprise data strategy.Leading the way for mainstream adopt-

    ers is, by consequence, meeting enterprise

    and IT requirements for data management,

    security, data governance, and compliance

    in a new, more complicated, set of data

    that includes public social data, private

    customer data, third-party data enrich-

    ment, and storage in cloud and on-prem-

    ises. Over the past year, it has often felt like

    the fast-driving big data vehicle hit some

    pretty thick mud to plow through, and

    some in the industry argued that forc-ing Hadoop to meet the requirements of

    enterprise data management was missing

    the point of big data and data science. For

    now, we have seen most companies agree

    that risk and compliance are things that

    they must take seriously moving forward.

    Mainstream Adopters RedefiningCommodity Hardware

    As mainstream adopters worked

    through data management and governance

    hurdles for enterprise IT, next up was the

    startling exclamation: I thought you said

    that was cheap commodity hardware?!

    This has become an interesting reminder

    of the roots of big data and the difference

    with IT enterprise-class hardware.

    The explanation goes like this. Early

    developers and adopters were driven to

    solve truly big data challenges. In the sim-

    plest of terms, big data meant big hardware

    costs and, in order to solve that economic

    challenge, big data needed to run on the

    lowest cost commodity hardware and

    software that was designed to be fault-tolerant to cope with high failure rates with-

    out disrupting service. This is the purpose

    of HDFS, though HDFS does not differen-

    tiate how a data node is configured and

    this is where ITs standard order list differs.

    Enterprise infrastructure organiza-

    tions have been maintaining the data cen-

    ter needs of companies for years and have

    efficiently standardized orders with chosen

    vendors. In this definition of commodity

    servers, its more about industry standards

    in parts, and no proprietary hardware

    could limit the use of these servers as data

    nodes (or any other server needs in the data

    center). While big data implementationwith hundreds to thousands of servers per

    cluster strive for the lowest cost white box

    servers from less recognized industry ven-

    dors with the lowest cost components, their

    commodity servers can be as low as $2,000

    per server. Similar servers from industry

    recognized big names with their own com-

    ponents or industry best of breed com-

    ponents touting stringent integration and

    quality testing have averaged $25,000 per

    server in several recent Hadoop implemen-

    tations that we have been involved with. Wehave started to coin these servers as com-

    modity-plus for mainstream companies

    operationalizing Hadoop clustersand

    they dont seem to mind.

    Another discussion that continues

    from the early adopters is how a data

    node should be configured. Some imple-

    mentations concerned with truly big data

    configure data nodes with 25 front-load-

    ing bays and multi-terabyte slower SATA

    drives for the highest capacity within

    their cluster. Other implementations are

    more concerned with performance and

    opt for faster SAS drives at lower capaci-

    ties but balanced with more servers in the

    cluster for further increased performancefrom parallelism. Some hyper-perfor-

    mance-oriented clusters will even opt for

    faster SSD drives in the cluster. This also

    leads to discussions regarding multi-core

    CPUs and how much memory should

    be in a data node. And, there have been

    equations for the number of cores related

    to the amount of memory and number of

    drives for optimal performance of a data

    node. We have seen that enterprise infra-

    structure has leaned more toward fewer

    nodes in a production cluster (832 data

    nodes) rather than 100-plus nodes. Their

    reasoning is twofold: More powerful data

    nodes are actually more interchangeablewith data centers also converging data

    virtualization and private cloud strate-

    gies. Second, ordering more of the pow-

    erful servers can yield increased volume

    discounts and maintain standardization

    of IT servers in the data center.

    The Data Lake Gains TractionIn 2014, we saw more acceptance of

    the term data lake as an enterprise data

    architecture concept pushed by Horton-

    works and its modern data architectureapproach. The enterprise data hub is a

    similar concept promoted by Cloudera

    and also has some of the industry mind-

    share. Informally, we saw the data lake term

    used most often by companies seeking to

    understand an approach to enterprise data

    strategy and roadmaps. However, we also

    saw backlash from industry pundits that

    called the data lake a fallacy or murky.

    Terms such as data swamp and data

    dump were also thrown around as how

    things could go wrong without a good

    strategy and governance in place. Like the

    term big data, the data lake has started

    out as a high-level concept to drive further

    definition and patterns going forward.

    Throughout 2014, we worked with

    companies ready to define a clear, detailed

    strategy based on the data lake concept for

    enterprise data strategy. While this is pro-

    found, it is very achievable with data man-

    agement principles that require answers to

    new questions regarding a new approach

    to data architecture. Some issues are sim-

    ple and more technical, such as keepingonline archiving of historical data ware-

    house data still easily accessible by users

    with revised service-level agreements.

    Some issues are more fundamental, such as

    the data lake serving as single repository of

    all data including being a staging area for

    the enterprise data warehouse (with lower

    cost historical persistence for other uses as

    data scientists are more interested in raw

    unaltered data). Other concerns are a bit

    more complex, such as persisting customer

    YARN,introduced in

    Hadoop 2, completely

    changes the paradigmof

    data engines and access.

  • 8/10/2019 Big Data Sourcebook Second Edition

    9/52

  • 8/10/2019 Big Data Sourcebook Second Edition

    10/52

    industry

    u

    pdates

    8 BIG DATA SOURCEBOOK 2014

    or other privacy-compliant data in the data

    lake for analysis purposes. Data governance

    is concerned with who has access to priva-

    cy-controlled data and how it is used. Datamanagement questioned the duplication

    of enterprise data and consistency.

    These are hard data management and

    governance decisions for enterprises to

    make, but they are making themand

    acknowledging that patience and adapt-

    ability are key for the coming years as

    data technologies continue to evolve and

    change the landscape. The data lake will

    continue to prove itself and make a fun-

    damental shift in enterprise architecture

    in the coming years. When you take a stepback and watch the business and IT driv-

    ers, momentum, and technology develop-

    ment, you can see how the data lake will

    become an epicenter in enterprise data

    architecture. If you take two steps back,

    you will see how 2015 developments could

    begin the evolution that transforms the

    data lake into a data operating system for

    the enterprise, evolving beyond business

    intelligence and analytics into operational

    applications and further realization of ser-

    vice-oriented architectures.

    Whats AheadIn 2015, the mainstream adoption with

    enterprise data strategies and acceptance

    of the data lake will continue as data man-

    agement and governance practices provide

    further clarity. The cautionary tale of 2014

    to ensure business outcomes drive big data

    adoption, rather than the hype of previ-

    ous years will likewise continue. Hadoop

    is clearly here to stay and inevitable,

    and will have its well-deserved seat at the

    enterprise data table, along with otherdata technologies. While Hadoop wont be

    taking over the world any time soon and

    principle-based frameworks (such as our

    own modern data platform) recognize the

    evolution of both data technologies and

    computing price/performance on mod-

    ern data architecture. Besides the usual

    maturing and improvements overall and

    for existing big data tools, we predict some

    major achievements in big data for 2015

    that were keeping an eye on.

    The Apache Spark engine will con-

    tinue to mature, improve, and gain accep-

    tance in 2015. With this adoption and the

    incredible capabilities that it delivers, wecould start to see applications and capabil-

    ities beyond our imagination. Keep an eye

    for these early case studies as inspiration

    for your own needs.

    With deepening acceptance and recog-

    nition of YARN as the standard for operat-

    ing Hadoop clusters, open-source projects

    and existing vendors will port their prod-

    ucts to YARN certification and integration.

    This will not only close the gap between

    existing data technologies to work with

    Hadoop clusters but more exciting will beto see data technologies port over to YARN

    so that they can operate and improve their

    own capabilities within Hadoop. New

    engines and existing engines running on

    YARN in 2015 will further influence and

    drive the adoption of Hadoop in enter-

    prise data architecture.

    In 2014, we saw mainstream compa-

    nies requiring data management features

    such as security and access control. These

    first steps will be critical to keep an eye on

    during 2015 for your own companys data

    management requirements. Our concern

    here is that the sexy high-performanceworld of Spark and improved SQL capa-

    bilities will get the majority of attention,

    while the less sexy side of security and gov-

    ernance will not mature at the same rate.

    There is significant pressure to do so with

    the mountain of mainstream adopters

    waiting, so well keep an eye on this one.

    Finally, our most exciting item to watch

    in 2015 will be Hadoops subtle transfor-

    mation as business drivers move it beyond

    a primary write-once/read-many repu-

    tation to that of full create/read/update/

    delete (CRUD) operational capability at

    big data scale. The benefits of the Hadoop

    architecture with YARN and HDFS gowell beyond big data analytics and enter-

    prise data architects can start thinking

    about what a YARN data operating system

    can do with operational systems. In a few

    years, this could also redefine the data lake

    or well simply create another label for

    the industry to debate. Once big data, high

    performance, and CRUD requirements are

    met within Hadoop, enterprise architects

    will start thinking about the economies

    of scale and efficiency gained from this

    next-generation architecture.

    John OBrien is princi-

    pal and CEO of Radiant

    Advisors. With more than

    25 years of experience

    delivering value through

    data warehousing and

    business intelligence pro-

    grams, OBriens unique perspective

    comes from the combination of his roles

    as a practitioner, consultant, and vendor

    CTO in the BI industry. As a globally rec-

    ognized business intelligence thought

    leader, OBrien has been publishing arti-

    cles and presenting at conferences in

    North America and Europe for the past 10

    years. His knowledge in designing, build-

    ing, and growing enterprise BI systems

    and teams brings real-world insights to

    each role and phase within a BI pro-

    gram. Today, through Radiant Advisors,

    OBrien provides research, strategic advi-

    sory services, and mentoring that guidecompanies in meeting the demands of

    next-generation information management,

    architecture, and emerging technologies.

    In Q1 2014, Radiant Advisors released its

    Independent Benchmark: SQL on Hadoop

    Performance that captured the current

    state of options and widely varying perfor-

    mance. Radiant Advisors plans to release

    the next benchmark 1 year later in Q1 2015

    to quantify those efforts.

    In 2015, watch for Hadoops

    subtle transformationas

    business drivers move it

    beyond a primary write-once/

    read-many reputation.

    The State of Big Data in 2014

  • 8/10/2019 Big Data Sourcebook Second Edition

    11/52

    DBTA.COM 9

    sponsored content

    T of enterprise solutions

    has changed. It has become distributed and

    real-time work. A famousNY Times writer

    Thomas Friedman summarizes it succinctly,

    The World is Flat. In addition to this

    technological advancement, the compute

    and online world is demanding real-time

    answers to questions. These ever growing

    and disparate data sources need to beefficiently connected to enable new discovery

    and more insightful answers.

    To maintain competitive advantage in

    this new landscape, organizations must be

    prepared to weed out the hype and focus

    on proven ways to future-proof existing

    systems while efficiently integrating with

    new technologies to provide the required

    value of real-time insight to users and

    decision-makers. Companies need to focus

    on the following key requirements for new

    technologies to take advantage of data and

    find unique business value, new revenues.

    DISTRIBUTED

    The world is moving towards distributed

    architectures. Memory is becoming a

    commodity; the Internet is easily accessible

    and fairly inexpensive and with more sources

    of data creating an increase in information it

    is easy to understand how organizations will

    require multiple, distributed data centers to

    store it all.

    With distributed architectures comes a

    need for distributed features such as parallelingest or the ability to quickly obtain data

    using multiple resources/locations to enable

    real-time application access to information

    that is being processed. Then there is a

    need for distributed task processing, which

    helps to move the processes closer to the

    locations where data is stored, thus saving

    time and improving query performance as a

    side effect. Finally, there becomes a need for

    distributed query as well. This is the ability

    to perform a search of data across different

    locations, quickly in order to find hidden

    value within the data for improved business

    decision support.

    SCALABLE

    The next requirement revolves around

    ease of scalability. When working with

    distributed architecture, it is inevitable that

    companies will need to eventually scale outtheir applications across multiple locations

    in order to keep up with growing data

    demands. Technology that is easily scalable/

    adaptable is very important in long-term

    success and helps with managing ROI.

    FLEXIBLE

    Another requirement, due to the many

    different types of data being collected, is the

    ability to handle multiple data types. If a

    technology is too limited in the way it needs

    to collect information from structured,

    unstructured, semi-structured sources,

    organizations will find it difficult to grow

    their solution long-term due to concerns

    with data type limitations. On the other

    hand, a technology that is able to natively or

    alternatively store and access many types of

    information from multiple data sources will

    be key to enabling long-term competitive

    advantage and growth.

    COMPLEMENTARY

    And finally, there is a need to address

    existing and legacy solutions alreadyimplemented at a large scale. Most

    enterprises will not be tearing out widely

    implemented solutions spanning across

    their organization. It is important to require

    that any new technologies being assessed

    have the ability to complement existing

    legacy solutions as well as any potential new

    technologies that may add benefit to the

    business, its customers and solution/services.

    Todays enterprise success depends on the

    ability to obtain key information quickly and

    accurately and then apply that knowledge

    to your business to make more reliable

    decisions. Utilizing technology that is able

    to offer the peace of mind to be successful

    through distributed, scalable, flexible and

    complementary features is priceless.

    For over a quarter century, Objectivity,

    Inc.s embedded database software has

    helped discover and unlock the hiddenvalue in Big Data for improved real-

    time intelligence and decision support.

    Objectivity focuses on storing, managing

    and searching the connection details

    between data. Its leading edge technologies,

    InfiniteGraph, a unique distributed, scalable

    graph database, and Objectivity/DB, a

    distributed and scalable object management

    database, enable unique search and

    navigation capabilities across distributed

    datasets to uncover hidden, valuable

    relationships within new and existing data

    for enhanced analytics and facilitate custom

    distributed data management solutions for

    some of the most complex and mission-

    critical systems in operation around the

    world today.

    By working with a well-established

    technology provider with long-term, proven

    Big Data implementations, enterprise

    companies can feel confident that the future

    requirements of their organizations will be

    met along with the ability to take advantage

    of new technological advances to keep ahead

    of the market.For more information on how to get

    started with evaluating technologies for your

    business, contact Objectivity, Inc. to inquire

    about our complimentary 2-hour solution

    review with a senior technical consultant.

    Visit our website at www.objectivity.com for

    more information.

    OBJECTIVITY, INC.

    www.objectivity.com

    Big Data for Tomorrow

  • 8/10/2019 Big Data Sourcebook Second Edition

    12/52

    industry

    u

    pdates

    10 BIG DATA SOURCEBOOK 2014

    By Joe McKendrick

    TheEnabling Force

    Behind DigitalEnterprises

    F ,data management was part of

    a clear and well-defined mission in organiza-

    tions. Data was generated from transaction

    systems, then managed, stored, and secured

    within relational database management sys-

    tems, with reports built and delivered to busi-

    ness decision makers specs.

    This rock-solid foundation of skills,

    technologies, and priorities served enter-

    prises well over the years. But lately, this

    arrangement has been changing dramati-

    cally. Driven by insatiable demand for IT

    services and data insights, as well as theproliferation of new data sources and for-

    mats, many organizations are embracing

    new technology and methods such as cloud,

    database as a service (DBaaS), and big data.

    And, increasingly, mobile isnt part of a ven-

    dors pitch sheet, or futuristic overview at a

    conference presentation. Its part of todays

    reality, a part of everyday business. Many

    organizations are already providing faster

    delivery of applications, differentiated prod-

    ucts and services, and some are building

    new customer experiences through social,

    mobile, analytics, and cloud.

    Over the coming year2015we will

    likely see the acceleration of the following

    dramatic shifts in data management:

    1. More Automation to Manage

    the Squeeze

    There is a lot of demand coming from the

    user side, but data management profession-

    als often find themselves in a squeeze. Busi-

    ness demand for database services as well as

    associated data volumes is growing at a rateof 20% a year on average, a survey by Uni-

    sphere Research finds. In contrast, most IT

    organizations are experiencing flat or shrink-

    ing budgets. Other factors such as substantial

    testing requirements and outdated manage-

    ment techniques are all contributing to a cost

    escalation and slow IT response.

    Database professionals report that they

    spend more time managing database lifecy-

    cles than anything else. A majority still over-

    whelmingly perform a range of tasks manu-

    ally, from patching databases to performing

    upgrades. Compliance remains important

    and requires attention. As databases move

    into virtualized and cloud environments,

    there will be a need for more comprehen-

    sive enterprise-wide testing. Another recent

    Unisphere Research study finds that for more

    than 50% of organizations, it takes their IT

    department 30 days or more to respond to

    new initiatives or deploy new solutions. For

    a quarter of organizations, it takes 90 days

    or more. In addition, more than two-thirds

    of organizations indicate that the numberof databases they manage is expanding. The

    most pressing challenges they are facing as

    a result of this expansion are licensing costs,

    additional hardware and network costs, addi-

    tional administration costs, and complexity.

    (The Empowered Database: 2014 Enterprise

    Platform Decisions Survey, September 2014)

    As data professionals find their time

    and resources squeezed between managing

    increasingly large and diverse data stores,

    increased user demands, and restrictive

    The State of Big Data Management

  • 8/10/2019 Big Data Sourcebook Second Edition

    13/52

    BUILD A SINGLE VIEW

    OF YOUR CUSTOMER

    Dont spend months rebuilding

    data tables to combine multiple

    systems.

    Make decisions based on your business

    needs not on the limitations of

    NoSQL solutions.

    Use Postgres with JSON to import your data

    and relate customer to contracts and contracts

    to customers and anything else.

    Visit www.enterprisedb.comto learn more.

  • 8/10/2019 Big Data Sourcebook Second Edition

    14/52

    industry

    u

    pdates

    12 BIG DATA SOURCEBOOK 2014

    Changes are being driven by insatiable demand

    for IT services and data insights, as well as the

    proliferationof new data sources and formats.

    budgets, there will be greater efforts

    to automate data management tasks.

    Expect a big push to automation in the

    year ahead.

    2. Big Data Becomes Part ofNormal Day-to-Day Business

    Relational data coming out of transac-

    tional systems is now only part of the enter-prise equation, and will share the stage to

    a greater degree with data that previously

    could not be cost-effectively captured, man-

    aged, analyzed, and stored. This includes

    data coming in from sensors, applications,

    social media, and mobile devices.

    With increased implementations

    of tools and platforms to manage this

    dataincluding NoSQL databases and

    Hadooporganizations will be better

    equipped to prepare this data for con-

    sumption by analytic software. A recent

    survey of Database Trends and Applications

    readers finds 26% now running Hadoop

    within their enterprisesup from 12% 3

    years ago. A majority, 63%, also now oper-

    ate NoSQL databases at their locations

    (DBTA Quick Poll: New Database Tech-

    nologies, April 2014).

    3. Cloud Opens UpDatabase as a Service

    More and more, data managers and

    professionals will be working with cloud-

    based solutions and data, whether asso-ciated with a public cloud service, or an

    in-house database-as-a-service (DBaaS)

    solution. This presents many new oppor-

    tunities to provide new capabilities to

    organizations, as well as new challenges.

    Moving to cloud means new program-

    ming and data modeling approaches will

    be needed. Integration between on-prem-

    ises and off-premises data also will be

    intensifying. Data security will be a front-

    burner issue.

    Recent Unisphere Research surveys

    find that close to two-fifths of enterprises

    either already have or are considering run-

    ning database functions within a private

    cloud, and about one-third are currently

    using or considering a public cloud ser-

    vice. For more than 25% of organizations,

    usage of private-cloud services increased

    over the past year.Cloud and virtualization are being

    seamlessly absorbed into the jobs of most

    database administrators, and in some

    cases, reducing traditional activity while

    expending their roles. Database as a ser-

    vice (DBaaS), or running databases and

    managing data within an enterprise pri-

    vate cloud setting, offers data managers

    and executives a means to employ shared

    services to manage their fast-growing

    environments. The potential advantage

    of DBaaS is that database managers need

    not re-create processes or environments

    from scratch, as these resources can be

    pre-packaged based on corporate or com-

    pliance standards and made readily avail-

    able within the enterprise cloud. Close to

    half of enterprises say they would like

    to see capacity planning services offered

    through private clouds, while 40% look

    for shared database resources. A sim-

    ilar number would value cloud-based

    services providing automated database

    provisioning.

    4. Virtualization and Software-DefinedData Centers on the Way

    Until recently, mentioning the term

    platform brought images of Windows,

    mainframe, and Linux servers to mind.

    However, for most enterprises, platform

    has become irrelevant. This extends to

    the database sphere as wellmany of the

    functions associated with specific data-

    bases can be abstracted away from under-

    lying hardware and software.

    The use of virtualization is helping

    to alleviate strains being created by the

    increasing size and complexity of data-

    base environments. The use of virtual-

    ization within database environments is

    increasing. Almost two-thirds of organi-

    zations in a recent Unisphere Research

    survey say there have been increases over

    the past year. Nearly half report that morethan 50% of their IT infrastructure is

    virtualized. The most common benefits

    organizations report as a result of using

    virtualization within their database envi-

    ronments are reduced costs, consolidation,

    and standardization of their infrastructure

    (The Empowered Database: 2014 Enter-

    prise Platform Decisions Survey, Septem-

    ber 2014).

    Another emerging trendsoftware-

    defined data centers, software-defined

    storage, and software-defined network-

    ingpromises to take this abstraction to a

    new level. Within a software-defined envi-

    ronment, services associated with data

    centers and database servicesstorage,

    data management, and provisioningare

    abstracted into a virtual service layer. This

    means managing, configuring, and scaling

    data environments to meet new needs will

    increasingly be accomplished from a sin-

    gle control panel. It may take some time

    to reach this stage, as many of the compo-

    nents of software-defined environments

    are just starting to fall into place. Expectto see significant movement in this direc-

    tion in 2015.

    5. Data Managers and ProfessionalsWill Lead the Drive to SecureCorporate Data

    One need only look at recent headlines

    to understand the importance of data

    securitymajor enterprises have suf-

    fered data breaches over the past year, and

    in some cases, have taken CIOs and top

    The State of Big Data Management

  • 8/10/2019 Big Data Sourcebook Second Edition

    15/52

    DBTA.COM 13

    industry

    updates

    executives down with them. The rise of

    big data and cloud, with their more com-

    plex integration requirements, accessibil-

    ity, and device variety, has increased theneed for greater attention to data security

    and data governance issues.

    Data security has evolved to that of a

    top business challenge, as villains take

    advantage of lax preventive and detective

    measures. In many ways, it has become an

    enterprise-wide issue in search of leader-

    ship. Senior executives are only too pain-

    fully aware of whats at stake for their

    businesses, but often dont know how to

    approach the challenge. This is an oppor-

    tunity for database administrators andsecurity professionals to work together,

    take a leadership role, and move the enter-

    prise to take action.

    Over the coming year, database manag-

    ers and professionals will be called upon

    to be more proactive and lead their com-

    panies to successfully ensure data privacy,

    protect against insider threats, and address

    regulatory compliance. An annual survey

    by Unisphere Research for the Indepen-

    dent Oracle Users Group (IOUG) finds

    there is more awareness than ever of the

    critical need to lock down data environ-

    ments, but also organizational hurdles in

    building awareness and budgetary sup-

    port for enterprise data security (DBA

    Security Superhero: 2014 IOUG Enterprise

    Data Security Survey, October 2014).

    6. Mobile Becomes an Equal Client

    Mobile computing is on the rise, and

    increasingly mobile devices will be the cli-

    ent of choice with enterprises in the year

    ahead. This means creating ways to access

    and work with data over mobile devices.More analytics, for example, is being

    supported within mobile apps. Some of

    the leading BI and analytics solutions ven-

    dors now offer mobile apps that offer dash-

    boardsoften configurablethat provide

    insight and visibility into operational

    trends to decision makers who are outside

    of their offices. While industry watchers

    have been predicting the democratiza-

    tion of data analytics across enterprises

    for years, the arrival of mobile apps as front

    end clients to BI and analytics systems may

    be the ultimate gateway to easy-to-use

    analytics across the enterprise. By their

    very nature, mobiles apps need to bedesigned to be as simple and easy to use

    as possible. Over the coming year, mobile

    app access to key data-driven applications

    will become part of every enterprise.

    The ability to access data from any and

    all devices, of course, will increase secu-

    rity concerns. While many enterprises

    have tacitly approved the bring your own

    device (BYOD) trend in recent years,

    some are looking to move to corporate-

    issued devices that will help lock down

    sensitive data. The coming year will seeincreased efforts to better ensure the secu-

    rity of data being sent to mobile devices.

    7. Storage Enters the Limelight

    Storage has always been an unappreci-

    ated field of endeavor. It has been almost

    an afterthought, seen in disk drives and

    disk arrays running somewhere in the

    back of data centers. This is changing rap-

    idly, as enterprises recognize that storage is

    shaping their infrastructures capabilities.

    Theres no question that many organiza-

    tions are dealing with rapidly expand-

    ing data stores. Much of todays data

    growthcoming out of enterprise appli-

    cationsis being exacerbated by greater

    volumes of unstructured, social media and

    machine-generated data making their way

    into the business analytics platform. Many

    enterprises are also evolving their data

    assets into data lakes, in which enter-

    prise data is stored up front in its raw form

    and accessed when needed, versus being

    loaded into purpose-built, siloed data

    environments.The question becomes, then, where

    and how to store all this data. The storage

    approach that has worked well for orga-

    nizations over the decadesproduce data

    within a transaction system, then send it

    downstream to a disk, and ultimately, a

    tape systemis being overwhelmed by

    todays data demands. Not only is the

    amount of data rapidly growing, but

    more users are demanding greater and

    more immediate access to data, even

    when it may be several weeks, months, or

    years old.

    Over the coming year, there will be a

    push by enterprises to manage storagesmartlyversus simply adding more

    disk capacity to existing systems or pur-

    chasing new systems from year to year. A

    recent survey by Unisphere Research finds

    growing impetus toward smarter storage

    solutions, which include increased stor-

    age efficiency through data compression,

    information lifecycle management and

    consolidation, or deployment strategies

    such as tiered storage. At the same time,

    storage expenditures keep risingeating a

    significant share of IT budgets and imped-ing other IT initiatives. For those with

    significant storage issues, the share stor-

    age takes out of IT budgets is even greater

    (Managing Exploding Data Growth in

    the Enterprise: 2014 IOUG Database Stor-

    age Survey, May 2014).

    Whats Ahead

    The year 2015 represents new oppor-

    tunities to expand and enlighten data

    management practices and platforms to

    meet the needs of the ever-expanding

    digital enterprise. To be successful, dig-

    ital business efforts need to have solid

    data management practices underneath.

    As enterprises go digital, they will be rely-

    ing on well-managed and diverse data to

    explore and reach new markets.

    Joe McKendrick is an

    author and independent

    researcher covering inno-vation, information tech-

    nology trends, and markets.

    Much of his research work

    is in conjunction with Uni-

    sphere Research, a division of Information

    Today, Inc. (ITI), for user groups including

    SHARE, the Oracle Applications Users Group,

    the Independent Oracle Users Group, and the

    International DB2 Users Group. He is also

    a regular contributor to Database Trends

    and Applications,published by ITI.

    The State of Big Data Management

  • 8/10/2019 Big Data Sourcebook Second Edition

    16/52

    industry

    u

    pdates

    14 BIG DATA SOURCEBOOK 2014

    The State of Data Integration

    By Stephen Swoyer

    Data IntegrationEvolves to Supporta Bigger Analytic Vision

    W made data a

    hard problem is precisely the issue of access-

    ing, preparing, and producing it for machine

    and, ultimately, for human consumption.

    What makes this a much harder problem in

    the age of big data is that the information

    were consuming is vectored to us from so

    many different directions.

    The data integration (DI) status quo is

    predicated on a model of data-at-rest. The

    designated final destination for data-at-rest is

    (and, at least for the foreseeable future, will

    remain) the data warehouse (DW). Tradi-

    tionally, data of a certain type was vectored

    to the DW from more or less predictable

    directionsviz., OLTP systems, or flat files

    and at the more or less predictable velocities

    circumscribed by the limitations of the batch

    model. Thanks to big data, this is no lon-ger the case. Granted, the term big data is

    empty, hyperbolic, and insufficient; granted,

    theres at least as much big data hype as big

    data substance. But still, as a phenomenon,

    big data at once describes 1) the technolog-

    ical capacity to ingest, store, manage, syn-

    thesize, and make use of information to an

    unprecedented degree and 2) the cultural

    capacity to imaginatively conceive of and

    meaningfully interact with information in

    fundamentally different ways. One conse-

    quence of this has been the emergence of a

    new DI model that doesnt so much aim to

    supplant as to enrich the status quo ante. In

    addition to data-at-rest, the new DI model is

    able to accommodate data-in-motioni.e.,

    data as it streams and data as it pulses: from

    the logs or events generated by sensors or

    other periodic signalers to the signatures or

    anomalies that are concomitant with aperi-

    odic events such as fraud, impending failure,

    or service disruption.

    Needless to say, comparatively little of

    this information is vectoring in from con-

    ventional OLTP systems. And thatas poet

    Robert Frost might put itmakes all the

    difference.

    Beyond Description

    Were used to thinking of data in termsof the predicates we attach to it. Now as ever,

    we want and need to access, integrate, and

    deliver data from traditional structured

    sources such as OLTP DBMSs, or flat and/

    or CSV files. Increasingly, however, were

    alert to, or were intrigued by, the value of the

    information that we believe to be locked into

    multi-structured or so-called unstruc-

    tured data, too. (Examples of the former

    include log files and event messages; the lat-

    ter is usually used as a kitchen-sink category

    to encompass virtually any data-type.) Even

    if we put aside the philosophical problem

    of structure as such (semantics is structure;

    schema is structure; a file-type is structure),

    were confronted with the fact that data inte-

    gration practices and methods must and

    will differ for each of these different types.

    The kinds of operations and transforma-

    tions we use to prepare and restructure the

    normalized data we extract from OLTP sys-

    tems for business intelligence (BI) reporting

    and analysis will prove to be insufficient (or

    quite simply inapposite) when brought to

    bear against these different types of data.

    The problem of accessing, preparing, and

    delivering unconventional types of data from

    unconventional types of sourcesas well as

    of making this data available to a new class

    of unconventional consumersrequires newmethods and practices, to say nothing of new

    (or at least complementary) tools.

    This has everything to do with what might

    be called a much bigger analytic vision.

    Inspired by the promise of exploiting data

    mining, predictive analytics, machine learn-

    ing, or other types of advanced analytics on a

    massive scale, the focus of DI is shifting from

    that of a static, deterministic disciplinein

    which a kind of two-dimensional world is

    represented in a finite number of well-defined

  • 8/10/2019 Big Data Sourcebook Second Edition

    17/52

    DBTA.COM 15

    sponsored content

    T,organizations have relied

    upon a single data warehouse to serve

    as the center of their data universe. This

    data warehouse approach operated on a

    paradigm in which the data revealed a single,

    unified version of the truth. But today, both

    the amount and types of data available have

    increased dramatically. With the advent

    of Big Data, companies now have access

    to more business-relevant information

    than ever before, resulting in many datarepositories to store and analyze it.

    THE CHALLENGES OFMOVING BIG DATA

    However, to use Big Data, you must

    be able to move it, and the challenges of

    moving Big Data are multi-faceted. Out of

    the gate, the pipes between data repositories

    remain the same size, while the data grows

    at an exponential rate. The issue worsens

    when traditional tools are used to attempt to

    access, process and integrate this data with

    other systems. Yet, companies cannot rely on

    traditional data warehouses alone.

    Thus, companies are increasingly

    turning to Apache Hadoopthe free, open

    source, scalable software for distributed

    computing that handles both structured

    and unstructured data. The movement

    towards Hadoop is indicative of something

    bigger: a new paradigm thats taking over

    the business worldthat of the modern

    data architecture and the data supply

    chain that feeds it. The data supply chain

    describes a new reality in which businessesfind themselves coordinating multiple data

    sources rather than using a single data

    warehouse. The data from these sources,

    which often varies in content, structure, and

    type, has to be integrated with data from

    other departments and other target systems

    within an enterprise. Big Data is rarely used

    en mass. Instead, different types of data tell

    different stories, and companies need to be

    able to integrate all of these narratives to

    inform business decisions.

    HADOOPS ROLE IN THEDATA SUPPLY CHAIN

    In this new world, companies must

    constantly move data from one place to

    another to ensure efficiency and lower costs.

    Hadoop plays a significant role in the data

    supply chain. However, its not an end-all

    solution. The standard Hadoop toolsets

    lack several critical capabilities, including

    the ability to move data between Hadoop

    and relational databases. The technologiesthat exist for data movement across

    Hadoop are cumbersome. Companies need

    solutions that make data movement to and

    from Hadoop easier, faster, and more cost

    effective.

    While open source tools like Sqoop are

    designed to deal with large amounts of data,

    they are often not enough by themselves.

    These tools can be difficult to use, require

    specialized skills and time to implement,

    typically focus only on certain types of data,

    and cannot support incremental changes or

    real-time feeds.

    EFFECTIVELY MOVING BIG DATAINTO AND OUT OF HADOOP

    The most effective answer to this

    challenge is to implement solutions that are

    specifically designed to ease and accelerate

    the process of data movement across a broad

    number of platforms. These technologies

    allow IT organizations to easily move data

    from one repository to another in a highly-

    visible manner. The software should also

    unify and integrate data from all platformswithin an enterprise, not just Hadoop. And

    they should include change data capture

    (CDC) technology to keep the target data up

    to datein a way thats sensitive to network

    bandwidth.

    Attunity offers a solution for companies

    looking to turbocharge the flows across their

    data supply chain while fully supporting

    a modern data architecture. Attunity

    Replicate features a user-friendly GUI, with

    a Click-to- Replicate design and drag-and-

    drop functionality to move data between

    repositories. Attunity supports Hadoop

    as a source and as a target, as well as every

    major commercial database platform and

    data warehouse available. It is scalable and

    manageable and can be used to move data

    to and from the cloud when combined with

    Attunity CloudBeam.

    MAKING BIG DATA & HADOOP

    WORK FOR YOU!Attunity enables companies to improve

    their data flows to capitalize on all their data,

    including Big Data sources. Their solutions

    limit the investment a company needs to

    make by reducing the hardware and software

    needed for managing and moving data

    across multiple platforms out-of-the box.

    Additionally, Attunity solutions are high

    performance and provide an easy-to-use

    graphical interface that helps companies

    make timely and fully-informed decisions.

    Using high-performance data movement

    software like Attunity, companies can not

    only unleash the full power of Hadoop but

    also the power of all their other technologies

    to enable real-time analytics and true

    competitive advantage.

    To learn more,

    download

    this Attunity

    whitepaper:

    Hadoop and

    the Modern DataSupply Chain

    http://bit.ly/HadoopWP

    ATTUNITY

    www.attunity.com

    Unleashing the Value of Big Data & Hadoop

  • 8/10/2019 Big Data Sourcebook Second Edition

    18/52

    industry

    u

    pdates

    16 BIG DATA SOURCEBOOK 2014

    The State of Data Integration

    dimensionsto a polygonal or probabilis-

    tic discipline with a much greater number

    of dimensions. The static stuff will still

    matter and will continue to power the

    great bulk of day-to-day decision making,

    but this will in turn be enriched, episod-

    ically, with different types of data. The

    challenge for DI is to accommodate and

    promote this enrichment, even as budgets

    hold steady (or are adjusted only margin-ally) and resources remain constrained.

    Automatic for the People

    What does this mean for data integra-

    tion? For one thing, the day-to-day work

    of traditional DI will, over time, be sim-

    plified, if not actually automated. This

    work includes activities such as 1) the

    exploration, identification, and mapping

    of sources; 2) the creation and mainte-

    nance of metadata and documentation;

    3) the automation or acceleration, insofar

    as feasible, of testing and quality assur-

    ance; and, crucially, 4) the deployment of

    new OLTP systems and data warehouses,

    as well as of BI and analytic applications

    or artifacts. These activities can and will

    be accelerated; in some cases (as with the

    generation and maintenance of metadata

    or documentation) they will, for practical,

    day-to-day purposes, be more or less com-

    pletely automated.

    This is in part a function of the matu-

    rity of the available tooling. Most DI and

    RDBMS vendors ship platform-specificautomation features (pre-fab source con-

    nectivity and transformation wizards; data

    model design, generation, and conversion

    tools; SQL, script, and even procedural

    code generators; scheduling facilities; in

    some cases even automated dev-testing

    routines) with their respective tools. Sim-

    ilarly, a passel of smaller, self-styled data

    warehouse automation vendors market

    platform-independent tools that purport

    to automate most of the same kinds of

    activities, and which are also optimized for

    multiple target platforms. On top of this,

    data virtualization (DV) and on-prem-

    ises-to-cloud integration specialists can

    bring intriguing technologies to bear, too.

    Most DI vendors offer DV (or data fed-

    eration) capabilities of some kind; others

    market DV-only products. None of these

    tools is in any sense a silver bullet: cus-

    tom-fitting and design of some kind isstill required andfranklyalways will

    be required. The catch, of course, is that

    even though such tools can likewise help

    to accelerate key aspects of the day-to-day

    work of building, managing, optimizing,

    maintaining, or upgrading OLTP and BI/

    decision support systems, they cant and

    wont replace human creativity and inge-

    nuity. The important thing is that they

    give us the capacity to substantively accel-

    erate much of the heavy-lifting of the work

    of data integration.

    Big Data Integration:Still a Relatively New Frontier

    This just isnt the case in the big data

    world. As Douglas Adams, author of The

    Hitchhikers Guide to the Galaxy,might put

    it, traditional data integration tools or ser-

    vices are mature and robust in exactly the

    way that big data DI toolsarent.

    At this point, guided and/or self-

    service features (to say nothing of man-

    agement-automation amenities) are still

    mostly missing from the big data offerings.As a result, organizations will need more

    developers and more technologists to do

    more hands-on stuff when theyre doing

    data integration in conjunction with big

    data platforms.

    Industry luminary Richard Winter

    tackled this issue in a report entitled The

    Real Cost of Big Data, which highlights

    the cost disparity between using Hadoop

    as a landing area and/or persistent store

    for data versus using it as a platform for

    business intelligence (BI) and decision

    support workloads. As a platform for data

    ingestion, persistence, and preparation,

    the research suggests, Hadoop is orders of

    magnitude cheaper than a conventional

    OLTP or DW system. Conversely, the cost

    of using Hadoop as a primary platform

    for BI and analytic workloads is orders of

    magnitude more expensive.

    An issue that tends to get glossed overis that of Hadoops efficacy as a data man-

    agement platform. Managing data isnt

    simply a question of ingesting and stor-

    ing it; its likewise, and to a much greater

    extent, a question of retrieving just the

    right data, of preparing it in just the right

    format, and of delivering it at more or less

    the right time. In other words, big data

    tools arent only less productive, than are

    those of traditional BI and decision sup-

    port, but big data management platforms

    are themselves comparatively immature,

    too. Generally speaking, they lack support

    for key database features or for core trans-

    action-processing concepts, such as ACID

    integrity. The simple reason for this is that

    many platforms either arent databases

    or eschew conventional DBMS reliabil-

    ity and concurrency features to address

    scaling-specific or application-specific

    requirements. The upshot, then, is that

    the human focus of data integration

    is shifting and will continue to shift to

    Hadoop and other big data platforms

    not least because these platforms tend torequire considerable human oversight and

    intervention.

    This doesnt mean that data, appli-

    cations, and other resources are shifting

    or will shift to big data platforms, never

    to return or to be recirculated. For one

    thing, theres cloud, which is having no

    less a profound impact on data integra-

    tion and data management. Data must be

    vectored from big data platforms (in the

    cloud or on-premises) to other big data

    The new DI model is able to accommodate

    data-in-motioni.e., data as it streams

    and data as it pulses.

  • 8/10/2019 Big Data Sourcebook Second Edition

    19/52

    DBTA.COM 17

    industry

    updates

    The State of Data Integration

    platforms (in the cloud or on-premises),

    to the cloud in generali.e., to SaaS, plat-

    form-as-a-service (PaaS), and infrastruc-

    ture-as-a-service (IaaS) resourcesand,

    last but not least, to good old on-premises

    resources like applications and databases.Theres no shortage of data exchange

    formats for integrating data in this con-

    textJSON and XML foremost among

    thembut the venerable SQL language

    will continue to be an important and even

    a preferred mechanism for data integra-

    tion in on-premises, big data, and even

    cloud environments. The reasons for this

    are many. First, SQL is an extremely effi-

    cient and productive language: According

    to a tally compiled by Andrew Binstock,

    editor-in-chief of Dr. Dobbs Journal, SQLtrails only legacy languages such as .ASP

    and Visual Basic (at number 1 and 2,

    respectively) and Java (at number 3) pro-

    ductivity-wise. (Binstock based his tally

    on data sourced from the International

    Software Benchmarking Standards Group,

    or ISBSG, which maintains a database

    of more than 6,000 software projects.)

    Second, theres a surfeit of available SQL

    query interfaces and/or adapters, along

    with (to a lesser extent) of SQL-savvy cod-

    ers. Third, open source software (OSS) and

    proprietary vendors have expended a sim-

    ply shocking amount of effort to develop

    ANSI-SQL-on-Hadoop technologies. This

    is a very good thing, chiefly because SQL

    is arguably the single most promising tool

    for getting the right data in the right for-

    mat out of Hadoop.

    Two years ago, for example, the most

    efficient ways to get data out of Hadoop

    included:

    1. Writing MapReduce jobs in Java

    in order to translate the simple

    dependency, linear chain, or directedacyclic graph (DAG) operations

    involved in data engineering into map

    and reduce operations;

    2. Writing jobs in PigLatin for Hadoops

    Pig framework to achieve basically the

    same thing;

    3. Writing SQL-like queries in Hive

    Query Language (HiveQL) to achieve

    basically the same thing; or

    4. Exploiting bleeding-edge technologies

    (such as Cascading, an API layered

    on top of Hadoop thats supposed to

    make it easier to program/manage) to

    achieve basically the same thing.

    Today, theres no shortage of mecha-

    nisms to get data from Hadoop. Take Hive,an interpreter that compiles HiveQL que-

    ries into MapReduce jobs. As of Hadoop

    2.x, Hive can leverage both Hadoops

    MapReduce engine or the new Apache Tez

    framework. Tez is just one of several designs

    that exploit Hadoops new resource man-

    ager, YARN, which makes it easier to manage

    and allocate resources for multiple compute

    engines, in addition to MapReduce. Thus,

    Apache Tezwhich is optimized for the

    operations, such as DAGs, that are charac-

    teristic of data transformation workloadsnow offers features such as pipelining and

    interactivity for ETL-on-Hadoop. Theres

    also Apache Spark, a cluster computing

    framework that can run in the context of

    Hadoop. Its touted as a high-performance

    complement and/or alternative to Hadoops

    built-in MapReduce compute engine; as of

    version 1.0.0, Spark is paired with Spark

    SQL, a new, comparatively immature, SQL

    interpreter. (Spark SQL replaces a prede-

    cessor project, dubbed Shark, which was

    conceived as a Hive-oriented SQL inter-

    preter.) Over the last year, especially, Spark

    has become one of the most hyped of

    Hadoop-oriented technologies; many DI or

    analytic vendors now support Spark to one

    degree or another in their products. Gener-

    ally speaking, most vendors now offer SQL-

    on-Hadoop options of one kind or another,

    while others also offer native (optimized)

    ETL-on-Hadoop offerings.

    Whats AheadCloud is a critical context for data inte-

    gration. One reason for this is that mostproviders offer export facilities or publish

    APIs that facilitate access to cloud data.

    Another reasonas I wrote last yearis

    that doing DI in the cloud doesnt inval-

    idate (completely or, even, in large part)

    existing best practices: if you want to

    run advanced analytics on SaaS data,

    youve either got to load it into an exist-

    ing, on-premises repository oralterna-

    tivelyexpose it to a cloud analytics pro-

    vider. What you do in the former scenario

    winds up looking a lot like what you do

    with traditional DI. And the good news is

    that you can do a lot more with traditional

    DI tools or platforms than used to be the

    case. Most data integration offerings can

    parse, shred, and transform the JSONand XML used for data exchange; some

    can do the same with formats such as

    RDF, YAML, or Atom. Several prominent

    database providers offer support for in-

    database JSONs (e.g., parsing and shred-

    ding JSONs via a name-value-pair func-

    tion or landing and storing them intact

    as variable character text), while others

    offer some kind of support for in-database

    storage (and querying) of JSON data. DV

    vendors are typically no less accomplished

    than the big DI platforms with respectto their capacity to accommodate a wide

    variety of data exchange formats, from

    JSON/XML to flat files.

    Any account of data integration and

    big data is bound to be insufficient sim-

    ply because there is so much happening.

    As noted, the Hadoop platform is by no

    means the onlynor, for that matter, the

    most excitinggame in town. Apache

    Spark, which (a) runs in the context of

    Hadoop and which (b) can both persist

    data (to HDFS, the Hadoop Distributed

    File System) and run in-memory (using

    Tachyon) last year emerged as a bona

    fide big data superstar. Spark is touted as

    a compelling platform for both analytics

    and data integration. Several DI vendors

    already claim to support it to some extent.

    Spark, like almost everything else in the

    space, will bear watching. And so it goes.

    Stephen Swoyer is a

    technology writer withmore than 16 years of

    experience. His writing

    has focused on business

    intelligence, data ware-

    housing, and analytics

    for almost a decade. Hes particularly

    intrigued by the thorny people and pro-

    cess problems most BI and DW vendors

    almost never want to acknowledge, let

    alone talk about. You can contact him at

    [email protected].

  • 8/10/2019 Big Data Sourcebook Second Edition

    20/52

  • 8/10/2019 Big Data Sourcebook Second Edition

    21/52

    DBTA.COM 19

    industry

    updates

    Increasingly, firms are splitting up their

    analytical teams into a model development

    and a model validation team.

    it is important to meticulously list all data

    within the enterprise that could poten-

    tially be beneficial to the analytical exer-cise. The more data, the better is the rule

    here. Analytical models have sophisticated

    built-in facilities to automatically decide

    what data elements are important for the

    task at hand and which ones can be left

    out from further analysis. The best way to

    improve the performance of any analytical

    model is by investing in data. This can be

    done by working on both the quantity and

    quality simultaneously. Regarding the for-

    mer, a key challenge concerns the aggre-

    gation of structured (e.g., stored in rela-

    tional databases) and unstructured (e.g.,

    textual) data to provide a comprehensive

    and holistic view on customer behavior.

    Closely related to this is the integration of

    offline and online data, an issue that many

    companies are struggling with nowadays.

    Furthermore, companies can also look

    beyond their internal boundaries and con-

    sider the purchase of external data from

    data poolers to complement their inter-

    nal analytical models. Extensive research

    has indicated that this is very beneficial in

    order to both perfect and benchmark theanalytical models developed.

    Although data is typically available in

    large quantities, its quality is often a more

    painful attention point. Here the GIGO

    principle applies: garbage in, garbage out,

    or bad data yields bad models. This may

    sound obvious at first. However, good

    data quality is often the Achilles heel in

    many analytical projects. Data quality can

    be evaluated by various dimensions such

    as data accuracy, data completeness, data

    timeliness, and data consistency, to name

    a few. To be successful in big data and

    analytics, it is necessary for companies tocontinuously monitor and remedy data

    quality problems by setting up master data

    management programs and creating new

    job roles such as that of data auditor, data

    steward, or data quality manager.

    Analytics should always start from a

    business problem rather than from a spe-

    cific technological solution. However, this

    comes with a chicken and egg problem.

    To identify new business opportunities,

    one needs to be aware of the technological

    potential first. As an example, think about

    the area of social media analytics. By first

    understanding how this technology works,

    a firm can start thinking about how to

    leverage this to study its online brand

    perception or perform trend monitoring.

    To bridge the gap between technology

    and the business, continuous education

    is important. It allows companies to stay

    ahead of the competition and spearhead

    the development of new analytical appli-

    cations. At this point, the academic world

    should make a mea culpa, since the offer-

    ing of Master of Science programs in thearea of big data and analytics is currently

    falling short of the demand.

    Another important component for

    turning data into concrete business

    insights and adding value using analyt-

    ics concerns the proper validation of the

    analytical models built. Quotes such as if

    you torture the data long enough, it will

    confess and terms such as data massage

    have cast a negative perspective on the field

    of analytics. It speaks for itself that analyt-

    ical models should be properly audited

    and validated and many mechanisms,

    procedures, and tools are available to dothis. Thats why more and more firms are

    splitting up their analytical teams into a

    model development and a model valida-

    tion team. Good corporate governance

    then dictates the construction of a Chinese

    wall between both teams, such that mod-

    els developed by the former team can be

    objectively and independently evaluated

    by the latter team. One might even con-

    template having the validation performed

    by an external partner. By setting up an

    analytical infrastructure whereby models

    are critically evaluated and validated on

    an ongoing basis, a firm is capable of con-

    tinuously improving its analytical models

    and thus, can better target its customers.

    Analytics is not a one-shot single-time

    exercise. In fact, the frustrating thing is

    that once an analytical model has been

    built and put into production, it is out-

    dated. Analytical models constantly lag

    behind reality, but the gap should be as

    minimal as possible. Just think about it:

    An analytical model is built using a sam-

    ple of data, which is gathered at a specificsnapshot in time given a specific internal

    and external environment. However, these

    environments are not static, but contin-

    uously change because of both internal

    (new strategies, changing customer behav-

    ior) as well as external effects (new eco-

    nomic conditions, new regulations). Think

    about a fraud detection model whereby

    crimina


Recommended