Post on 05-Jul-2015
description
transcript
Big Data Architectural Series:Creating a Next-Generation Big Data Architecture
facebook.com/perficient twitter.com/Perficientlinkedin.com/company/perficient
2
Perficient is a leading information technology consulting firm serving clients throughout
North America.
We help clients implement business-driven technology solutions that integrate business
processes, improve worker productivity, increase customer loyalty and create a more agile
enterprise to better respond to new business opportunities.
About Perficient
3
• Founded in 1997
• Public, NASDAQ: PRFT
• 2013 revenue $373 million
• Major market locations:
• Allentown, Atlanta, Boston, Charlotte, Chicago, Cincinnati,
Columbus, Dallas, Denver, Detroit, Fairfax, Houston,
Indianapolis, Lafayette, Minneapolis, New York City,
Northern California, Oxford (UK), Philadelphia, Southern
California, St. Louis, Toronto, Washington, D.C.
• Global delivery centers in China and India
• >2,200 colleagues
• Dedicated solution practices
• ~90% repeat business rate
• Alliance partnerships with major technology vendors
• Multiple vendor/industry technology and growth awards
Perficient Profile
BUSINESS SOLUTIONS
Business Intelligence
Business Process Management
Customer Experience and CRM
Enterprise Performance Management
Enterprise Resource Planning
Experience Design (XD)
Management Consulting
TECHNOLOGY SOLUTIONS
Business Integration/SOA
Cloud Services
Commerce
Content Management
Custom Application Development
Education
Information Management
Mobile Platforms
Platform Integration
Portal & Social
Our Solutions Expertise
Our Speaker
Bill Busch
Sr. Solutions Architect, Enterprise Information Solutions, Perficient
• Leads Perficient's enterprise data practice
• Specializes in business-enabling BI solutions that enable the agile enterprise
• Responsible for executive data strategy, roadmap development, and the delivery of high-impact solutions that enable organizations to leverage enterprise data
• Bill has over 15 years of experience in executive leadership, business intelligence, data warehousing, data governance, master data management, information/data architecture and analytics
Perficient’s Big Data Architectural Series
Business
Case
Next
Generation
Architecture
Future Topics
• Data Integration
• Stream
Processing
• NoSQL
• SQL on Hadoop
• Data Quality
• Governance
• Use Cases &
Case Studies
Today’s
Webinar
Today’s Objectives
5 Architectural
Roles For Hadoop
Hadoop
Ecosystem
Potential
vs. Reality
Realizing A
Hadoop
Centric
Architecture
Today’s Objectives
5 Architectural
Roles For Hadoop
Hadoop
Ecosystem
Potential
vs. Reality
Realizing A
Hadoop
Centric
Architecture
“Big Data is high-volume, high-velocity and high-variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.”
Convergence of structured, unstructured,and dark data
Big Data is the evolution of data creating similar data management issues that IT has struggled to address
for the last 20+ years.
Three Views of Big Data
“Big Data is high-volume, high-velocity and high-variety information assets that demand cost-
effective, innovative forms of information processing for enhanced insight and decision
making.”
Convergence of structured, unstructured, and dark data
Big Data is the evolution of data creating similar data management issues that IT has struggled to
address for the last 20+ years.
Three Views of Big Data
Common Big Data Business Use Cases
Improve Strategic
Decision Making
Customer
Experience
Analysis
Operational
Optimization
Risk and Fraud
Reduction
Data Monetization
Security Event
Detection and
Analysis
IT Cost
Management
Expanding Data Ecosystem
• Customer
Intelligence
• Operations
• Risk& Fraud
• Data
Monetization
• Strategic
Development
• Security
Intelligence
• IT Optimization
Structured Data
(5-20% of Total)
Point-of-Sale
Text Messages
Contracts &
Regulatory
Preferences &
Emotions
Security AccessWeather
Machine Data
Automobile
Mobile
Communications
Geospatial
Social
Data
Ecosystem
Enterprise Data ArchitectureNext Generation
The PromiseData Architecture Simplification
Data IntegrationData HubAnalytics
Stream ProcessingData Warehouse Operational Data
Hadoop Cluster
The RealityMaturity Limits the Use Cases
• Realize the potential of Hadoop
• Multi-tenancy is in its infancy
• Hadoop 2.0 and YARN
• Most third-party applications are just
moving to YARN
• Hive (and other SQL on Hadoop
solutions) maturing
• Robust enterprise functionality is
evolving
• Security
• High Availability
Different Types of “Open Source Hadoop”
Apache
Projects
Only
Proprietary
Value Add & Re-
Development
Apache
Projects +
Proprietary
Add-ons
Packaged and
Online Solutions
• IBM Big Insights
• Oracle Big Data
Appliance
• HDInsight
• Many others!
Choosing A Hadoop Distribution
Company Philosophy
Current Relationships
Acceptable Risk
Specialized Functionality
Quick Primer on YARN
What is Yarn?
• Yet Another Resource Manager
• Sometimes referred as
MapReduce 2.0
• Data operating system
• Fault-Tolerance
Why is this important?
• Enables multi-tendency on
Hadoop
• Moves processing to the data*Image Provided by HortonWorks
Today’s Objectives
5 Architectural
Roles For Hadoop
Hadoop
Ecosystem
Potential
vs. Reality
Realizing A
Hadoop
Centric
Architecture
Hadoop
Analytics
Data Warehouse
Stream Processing
Data Factory
Transactional Data Store
Five Common Architectural RolesHadoop Big Data Use Cases
Enterprise Data ArchitectureNext Generation
Hadoop
Analytics
Data Warehouse
Stream Processing
Data Factory
Transactional Data Store
Five Common Architectural RolesHadoop Big Data Use Cases
Analytical Processing
Source Wrangle Data Model & Tune Operationalize1 2 3 4
• Data Ingestion
• Metadata
Management
• Data Access
• Data Preparation
Tools
• Data Discovery
&Visualization
• Data Wrangling
Tools
• Business Glossary
& Search
• Data Access
• Data Discovery &
Visualization
• Analytical Tools
• Analytical
Sandbox
• Business Created
Reporting
• Model Execution &
Management
• Knowledge
Management
(Portal)
Analytical
Process
Architectural
Capabilities
Analytical Processing
Source Wrangle Data Model & Tune Operationalize1 2 3 4
• Data Ingestion
• Metadata
Management
• Data Access
• Data Preparation
Tools
• Data Discovery
&Visualization
• Data Wrangling
Tools
• Business Glossary
& Search
• Data Access
• Data Discovery &
Visualization
• Analytical Tools
• Analytical
Sandbox
• Business Created
Reporting
• Model Execution &
Management
• Knowledge
Management
(Portal)
Analytical
Process
Architectural
Capabilities
Data Access
• There are many methods
to accessing Big Data
• Direct HDFS
• NoSQL / Connector
• Hive/ SQL On Hadoop
• Align tool to access
methods and file types
• Data Preparation
• Analytics Source
Files/DataTidy Data
Data
Preparation
Tool
Analytics
Tool
Analytical
Result
Read Access
Write Access
Key
Hadoop Cluster
Hadoop
Analytics
Data Warehouse
Stream Processing
Data Factory
Transactional Data Store
Five Common Architectural RolesHadoop Big Data Use Cases
Data Warehouse Roles
• Two models for splitting processing
• Hot – Cold• Data Warehouse Layer
• Push high user loads to traditional data warehouses
• Fully investigate DW-Hadoop connector functionality
• Leverage opportunity to use in-memory database solutions
Data Warehouse Layer Approach
Hadoop Cluster Traditional DW/DM
Hot – Cold Data Warehouse
Cold Data
Hadoop Cluster Traditional DW/DM
Hot Data
Data WarehouseOrganize Your Data
• Types of data stored on
cluster
• Analytical sandboxes
• Team
• Individual
• Quotas
• Potential to replace
information lifecycle
management solutions
• No right answer – clearly
define usage
Consolidated
Data
Streaming
Queues
Delta’s
(Incremental)
Common Data (Dimensions, Master Data)
Improved / Modeled Data
Published, Analytical and Aggregates
Sandbox Zone
Raw Data Processed Data
Hadoop Cluster
Archived Data
Hadoop
Analytics
Data Warehouse
Stream Processing
Data Factory
Transactional Data Store
Five Common Architectural RolesHadoop Big Data Use Cases
Stream and Event Processing
• Dedicated vs. Shared Model
• Persistence of messages, logs, etc.
• Long-term storage
• Queuing
• Pre-load (HDFS) vs. Post-load
processing
• Micro-Batch vs. One-at-a-Time
• Programing language support
• Processing guarantee
• At most once
• At least once
• Exactly once
Let business requirements drive need for streaming solutions. It is acceptable to use more
than one solution as long as the roles / purposes of each are clearly defined.
Hadoop
Analytics
Data Warehouse
Stream Processing
Data Factory
Transactional Data Store
Five Common Architectural RolesHadoop Big Data Use Cases
The Data Integration Challenge
Key Point: Hadoop and Hadoop-related technologies can address these challenges.
However, they must be architected and governed properly
Volume, variety, and
velocity create unique
challenges for data
integration
10,000+ unique entities
(or file groups) may have
to be managed
Batch windows are still
the same or shrinking
The Challenge
Data Factory & Integration
Hadoop Distributed
Tools
Data Integration
Packages
Hybrid (Both Hadoop
and Data Integration
Package)
• Leverages tools included in
the Hadoop Distribution and
programing languages
• Scoop, Flume, Spark, Java,
MapReduce are examples
• Tools can be implemented in
many different modes
• Hand-coded/scripted
• Runtime Configured
• Generated
• Based on use case
leverages both Hadoop and
COTs tools to move and
transform data
• Leverage commercial data
integration packages to
move and transform data
• IBM Infosphere Big Insights,
Informatica are examples
• Key questions, where is
processing taking place and
does the tool use YARN
resource manger?
Approaches to Big Data Integration
Define Pipelines and Stages
Sqoop
Cloud
Sources
RDBMS
File
HubFTP
Packaged
Tool
Object
DBMSETL Tool
Log
DataFTP
Stream/
Message
Bus
Kafta
Sqoop
Storm
ExtractHDFS Load &
Formatting
Scraping&
Normalization
MCF
Storm
Cleansing ,
Aggregation
Transformation
Package
ETL Tool
Storm
Data Distribution Data Access &
Distribution
RDBMS/DW
/IMDB
Hive
Hbase
File
Extracts
NoSQL
Stream
Output
Custom
Sqoop
Custom
Custom
Message
Bus
ETL
Tool ETL Tool
Big Data Integration FrameworkTypical Services
Key Guidance:
• In lieu of using a ETL product, consider building a Big
Data Integration framework
• Apache Falcon provides pipeline management
• Focus is on making all components run-time
configurable with metadata
• Can offer significant cost savings over the long run
Load UtilityMetadata
Collection Metadata
Pipeline
Config
Files
Metadata
Config Files
Pipeline Utilities
Parser
(Delimiter)
Data
Standardization
HIVE
Publishing
MF Coding
Converters
File Joiner &
Transport
Logging
Checksum
Retention
Replication
Late Arriving
Data
Exception
Handling
Pipeline Master (ex. Falcon)
DB Copy
Archival
Audit
Sqoop Flume
HDFS Shell
Hadoop
Analytics
Data Warehouse
Stream Processing
Data Factory
Transactional Data Store
Five Common Architectural RolesHadoop Big Data Use Cases
SQL on Hadoop
• SQL on Hadoop is changing
• Historically focused on read
functionality for analytics
• New breed of SQL on Hadoop
• BI and operational
reporting
• Transaction Processing
*Image Provided by Splice Machine
Transactions In Hive
Today’s Objectives
5 Architectural
Roles For Hadoop
Hadoop
Ecosystem
Potential
vs. Reality
Realizing A
Hadoop
Centric
Architecture
Common Big Data Business Use Cases
Improve Strategic
Decision Making
Customer
Experience
Analysis
Operational
Optimization
Risk and Fraud
Reduction
Data Monetization
Security Event
Detection and
Analysis
IT Cost
Management
Architectural Scenarios
Architecture
Role
Business Use Case Analytics
Data
Warehouse
Stream
Processing Data Factory
Transactional
Data Store*
Strategic Decision
Making P s
Customer Experience P s P s
Operational
Optimization P s s s
Risk and Fraud
Reduction P s P
Data Monetization s s P
Security Event
Detection and Analysis P s s s
IT Cost Management P s P P
* Capability is just emerging within the Hadoop
ecosystem. Consider this use case for isolated
business cases and early adopters.P = Primary Use Case s = Secondary Use case
Integrating Hadoop into the Enterprise
Determine
Business Use
Cases
Understand
Current Tools
& Architecture
Align Business
Use Case
Priorities
Build
Roadmap
Specify
Solution
Architecture
Update &
Maintain
Roadmap
Implement
Roadmap
Final Thoughts
Do
• Match the business use case to the big data role
• Clearly define a roadmap
• Establish clear architectural standards to drive
• Consistency
• Re-use of resources
• Homework when defining a solution architecture
Don’t
• Select an initial use case that relies on immature
Hadoop functionality
• Leverage tools that move data off the cluster for
processing then storing the data back on the cluster
• Assume all Hadoop technologies integrate well together
As a reminder, please submit your
questions in the chat box.
We will get to as many as possible.
Daily unique content
about content
management, user
experience, portals
and other enterprise
information technology
solutions across a
variety of industries.
Perficient.com/SocialMedia
Facebook.com/Perficient
Twitter.com/Perficient
Thank you for your participation today.Please fill out the survey at the close of this session.