Date post: | 21-Jan-2017 |
Category: |
Technology |
Upload: | julius-remigio-cbip |
View: | 172 times |
Download: | 1 times |
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 1
Alex Garbarini, Data Lake Service Owner
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 2
§ What is a Data Lake in Today’s Climate?
§ Starting the Data Lake Journey
§ Data Management Options
§ Automated Data Ingestion Pipelines
§ File System Layout and Security
§ Enterprise Processes
§ Cisco Operational Use Cases
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 3
• Data Lake - a place to store practically unlimited amounts of data of any format, schema and type that is relatively inexpensive and massively scalable. Data processing software like Hadoop can transform the data from its raw state to a finished product.
~ Revelytix
• If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
~ Pentaho
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4
Data Lake
Data Reservoir
Data Swamp
Data Ponds
“Tread carefully, you must, or the DARK side of the swamp you will find.”
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 5
Can you taste the rainbow… of problems?
Hadoop
Platform
App Data App
Data App
Data App
Data App
Data
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 6
Initial Data Lake Objectives:
• Eliminate Silos & Data Reuse
• Optimize Data Ingestion from Source Systems
• Metadata Management
• Data on Tap
• Provide All (Useful, Enterprise only) Data on One Platform
Hadoop
Platform
App Data App
Data App
Data App
Data App
Data
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 7
Hadoop
Platform
App Data App
Data App
Data App
Data App
Data
“Build In-House” – or – “Buy”
“Data Lake” – or –
“Data Reservoir”
“Self Serviced” – or –
“Managed”
Key Decisions Best Choice
It Depends!
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 8
Hadoop
Platform
App Data App
Data App
Data App
Data App
Data
“Build In-House” – or – “Buy”
“Data Lake” – or –
“Data Reservoir”
“Self Serviced” – or –
“Managed”
Key Decisions Our Choice
Build in House
Data Lake
Managed
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 9
• This translates to “Data on Tap.” • Automated Data Ingestion Pipeline sounds fancy, but it just means creating easy ways for new data
to become incorporated into the Lake. • Build for the most common data sources: Relational, File System, Streaming, Web Service… • Employ best practices (E.G. security, governance, compliance, impact assessment…)
Design
Develop
Implement
Audit
Normalize
Add an Entry to your
Metadata
Get Coffee
Automated Process
Typical Process
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 10
• Cisco solves this problem with an automated ingest engine driven from a metadata repository:
Metadata Repo
Data Sources 12 6
9 3
Scheduler
Hadoop Platform
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 11
• Control and Segregate the Data ingested by domain and access. • Security by design is better but not always realistic or necessary to the same degree. • Keep PII/SOX/non-owned data under separate restrictions
• Understand the purpose of the data being imported or it will need to move! • (Most common cause of a lake turning into a swamp)
• Store data in the format it will be consumed.
• You can’t please everyone; so don’t compromise implementation to please no one.
• Understand the security required with impact assessments.
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 12
Hadoop Platform
App
App
App
App
App
Data Lake
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 13
Enterprise Data
Supply Chain
Services
Reference
Sales
Channels Public
Restricted Pre-Sales
Post-Sales …
Internal Data
External Data
Facebook Projects
• Unix Level Control: • Data access groups for each restricted, final
level • Mode assignment for restricted groups
• Simple Metadata Driven Access Definitions
Public Restricted RWX R-X R-X RWX R-X ---
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 14
Subscription
Data Catalogs
Self-Service
Automation
Self Healing
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 15
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 16
Thank You