Date post: | 15-Jul-2015 |
Category: |
Technology |
Upload: | inside-analysis |
View: | 157 times |
Download: | 2 times |
Grab some
coffee and
enjoy the
pre-show
banter before
the top of the
hour!
The Briefing Room
The Great Data Lakes: How to Approach a Big Data Implementation
Twitter Tag: #briefr The Briefing Room
Reveal the essential characteristics of enterprise software, good and bad
Provide a forum for detailed analysis of today’s innovative technologies
Give vendors a chance to explain their product to savvy analysts
Allow audience members to pose serious questions... and get answers!
Mission
Twitter Tag: #briefr The Briefing Room
Topics
April: BIG DATA
May: CLOUD
June: INNOVATORS
Twitter Tag: #briefr The Briefing Room
Will History Repeat Itself Again?
Ø Partitioning matters
Ø File formats matter
Ø Metadata matters
Ø Access patterns matter
Hadoop may be schema-agnostic, but that doesn’t mean you shouldn’t carefully plan your implementation!
“I’ve always found that plans are useless, but planning is indispensable.”
Twitter Tag: #briefr The Briefing Room
Analyst: Robin Bloor
Robin Bloor is Chief Analyst at The Bloor Group
[email protected] @robinbloor
Twitter Tag: #briefr The Briefing Room
Think Big, A Teradata Company
Last year Teradata acquired Think Big Analytics, Inc., a consulting and solutions company focused on big data solutions
Think Big has expertise in implementing a variety of open source technologies, such as Hadoop, Hbase, Cassandra, MongoDB and Storm, as well as experience with Hortonworks, Cloudera and MapR
Its consultants can assist with the planning, management and deployment of big data implementations
Twitter Tag: #briefr The Briefing Room
Guest: Rick Stellwagen
Rick Stellwagen is Data Lake Program Director at Think Big, A Teradata Company. Rick is responsible for defining and rolling out a Data Lake Solution portfolio, identifying and integrating internal and external best in class technologies. He is defining the deployment model, offerings, skills, career path and integrated capabilities required for data lake construction and rollout. He also works with product management, engineering, marketing and external partner alliances to define thought leadership positions and shape product plans both internally and externally.
MAKING BIG DATA COME ALIVE MAKING BIG DATA COME ALIVE
Data Lake Deployment Best Practices
Rick Stellwagen, Data Lake Program Director April 7, 2015
CONFIDENTIAL | 11
A centralized repository of raw data into which all data-producing streams
flow and from which downstream facilities may draw
What is a Data Lake?
11
Information Sources Data Lake Downstream Facilities
Data Variety is the driving factor in building a Data Lake
CONFIDENTIAL | 12
Swamp Reservoir
Data Lake: Swamp or Reservoir?
12
CONFIDENTIAL | 13
� Corporate Data Sourcing – Repository – System of Record - Govern who, what and when data is accessed or provisioned - Track usage, resolve anomalies, visualize, optimize and clarify data lineage
� Historical Data Offload - Offload history of operational and analytical data platforms - Centralized control of restore capabilities and leverage deep data history
� Data Discovery, Organization and Identification - Gain ultimate flexibility in data use and access Schema on read - Lightly conditioned, un-modeled, flexible modeling
� ETL Offload - Foundation for Data Integration – push staging to Hadoop - Data Quality and validation
� Business Reporting - OLAP analysis sourced & processed directly from the data lake
Primary Data Lake Use Cases
13
CONFIDENTIAL | 14
• A Data Reservoir is a managed Data Lake that seeks to guarantee quality, access, provenance, and governance.
• An important extra guarantee that makes a
Data Reservoir is the presence of metadata that might enable non subject matter experts to easily know the location of and entitlements to the various forms of stored data within.
• Schema Metadata is always a given, but……
14
Data Lake: Swamp or Reservoir?
CONFIDENTIAL | 15
Business-Ontology
15
How does this data relate to other data?
How do we classify this data
within the business?
CONFIDENTIAL | 16 16
Business-Security
Who can read the data?
Who owns the data?
Who belongs to what
group?
LDAP
Argus
Unix bitmask Permissions
Who can see a column?
CONFIDENTIAL | 17 17
Operational
Where did my data come from?
Any environmental context
about the landing zone, OS,
where my data came from?
What processes touched my data?
When did my data get ingested? ... get transformed? ... get exported?
Identity?
CONFIDENTIAL | 18 18
Business-Index
What contents are in a file?
What is the data
serialization?
Where can we find certain content in the file?
What terms are in the contents?
e-Discoverysolr
a lot of NoSQL
FileMagic Number
CONFIDENTIAL | 19 19
Business-Schema
How does my data denormalize?
How should I interpret
my data?
What are my column names?
Are there any “important” dimensions?
Metareposi
tory
HCatalog
CONFIDENTIAL | 20 20
Data Lake Information Sources
Evaluate Source Data Ingest
Collect & Manage
Metadata
Profile - Structure
Sequence
Downstream
Facilities
Generate Reports
Discovery Signals Compress
Automate
Protect
Prepare Data for Ingest
Prepare Source Metadata
Assembling the Reservoir
Perimeter-Authentication-Authorization
Data Hub
Generate Reports
CONFIDENTIAL | 21
Enterprise Data Lake Architecture
21
� Each Region has different “areas”
� Three areas for three types of usage - Data Treatment
- Data Reservoir
- Data Lab
Regional Data Treatment Facility
Regional Reservoir Regional Lab
Op MetaData Index
Collection Pools
Ingest Zone SOR Zone
Export Zone
Orchestration VM
OrchestrationDB
Monitoring
MasterCompute Cluster
Biiz MetaData Index
Orchestration VM
OrchestrationDB
Monitoring
Lake
MasterData
Export Zone
<LOB> Zone
MasterCompute Cluster
Lake
MasterData
<Insight B><Insight A>
VCC VCC
Processes
op md index
HAR Compactor
Ingestion/SOR Reconciliation
de-dup
key generation
Processes
xcorrelate
xco-locate
xcleanse
de-ident X Y
VirtualCompute Cluster
continuous
bulk
metadatacapture
metadatacapture
metadatacapture
de-identification
Key: Validate that Ingestion captures Metadata
CONFIDENTIAL | 22
Data Treatment
22
� Used by Operations only � Restricted � Non-business process � Lowest-Common-
Denominator Data Serialization
� The entry point for ALL your data
MasterCompute Cluster
Ingest Zone SOR Zone
Export Zone
Op MetaData Index
MonitoringOrchestrationDB
Orchestration VM
Regional Data Treatment Facility
Collection Poolscontinuous
bulk
metadatacapture
Make sure you capture
Metadata!
Or you risk a swamp
downstream
CONFIDENTIAL | 23
MasterData
<LOB> Zone
Export Zone
MasterCompute Cluster
MonitoringOrchestrationDB
Orchestration VM
Lake
Biz MetaData Index
MPP Fast Analytics
Regional ReservoirProcesses
xcorrelate
xco-locate
xcleanse
de-ident
Data Reservoir
23
� Used by Business AND Operations
� Marting ! � Business processes � DSS � No Ad Hoc � Business Restricted � First Introduction of SME
Don’t let in
un-vetted data!
CONFIDENTIAL | 24
Data Lab
24
� Used by business primarily
� “Un-Safe” Data � Ephemeral (think
virtualization) � Highly experimental � New technologies � Ad Hoc
Regional Lab
Lake
MasterData
<Insight B><Insight A>
VCC VCC
X Y
VirtualCompute Cluster
CONFIDENTIAL | 25
• Know where you are headed – build on Roadmap or Optimizer Planning • Quickly put into practice references for company wide Data Lake ingest • Establish data lineage and governance tracking with metadata services • Establish standards and practices to scale out your data ingest • Develop standards for doing profiling and discovery • Build out a pipeline framework for data transformations • Develop a Security Plan (perimeter, authentication & authorization) • Develop an archive and information security approach • Plan out next steps and approach for discovery and reporting
Data Lake Best Practices
25
Twitter Tag: #briefr The Briefing Room
Perceptions & Questions
Analyst: Robin Bloor
Robin Bloor, PhD
There Has Been a Clear Shift
Analytics & BI were previously EDW-centric
They are becoming Data Lake-centric
§ Inexpensive (?) § Any data § May have metadata § Poor performance § Weak scheduling § Weak data mgmt § Security? § Data Lake
§ Expensive § Prepared data § Will have metadata § Optimized performance § Optimized scheduling § Good data mgmt § Secure § Data workhorse
Hadoop vs Data Mgmt Engine
Hadoop DBMS/EDW
Big Data Architecture - 1
Think Logical, Implement Physical
Big Data Architecture - 2
Big Data Architecture - 3
§ Multiple local instances of Hadoop § Weak data placement § Metadata chaos § Lack of tuning capability § Security (expense) § User self-service becoming a file
system nightmare
Straws in the Wind
Operational Concerns
The Need for Best Practices
This is clear:
Data Lake is a new idea
u Is a data lake really just a multiplicity of data marts growing wild?
u Aside from performance-critical workloads, what should Hadoop not be used for?
u Do you have any specific recommendations for metadata management in a data lake?
u Is there a need for enforced provenance & lineage?
u Security question: Encryption?
u Where does streaming fit into the picture?
Twitter Tag: #briefr The Briefing Room
Twitter Tag: #briefr The Briefing Room
Upcoming Topics
www.insideanalysis.com
April: BIG DATA
May: CLOUD
June: INNOVATORS
Twitter Tag: #briefr The Briefing Room
THANK YOU for your
ATTENTION!
Some images provided courtesy of Wikimedia Commons