Date post: | 15-Jan-2016 |
Category: |
Documents |
Upload: | scott-sullivan |
View: | 217 times |
Download: | 0 times |
The Data LakeA New Solution to Business Intelligence
Agenda
• Cas Apanowicz – An Introduction• A Little History• Traditional DW/BI• What is Data Lake• Why is better?• Architectural Reference• New Paradigm and Architectural Reference• Future of Data Lake• Q&A • Appendix A
Cas Apanowicz• Cas is the founder and was the first CEO of Infobright – the first Open Source Data Warehouse
company co-owned by San Microsystems and RBC Royal Bank of Canada• He is an accomplished IT consultant and entrepreneur, Internationally recognized IT practitioner who
has served as a co-chair and speaker on International conferences. • Prior to Infobright, Mr. Apanowicz founded Cognitron Technology, which specialized in developing
data mining tools, many of which were used in the health care field to assist in customer care and treatment.
• Previous to Cognitron Technology, Mr. Apanowicz worked in the Research Centre at BCTel where he developed an algorithm that measured customer satisfaction. At the same time, he was working in the Brain Center at UBC in Vancouver applying ground-breaking algorithms for brain reading interpretation. As well, he offered his expertise to Vancouver General Hospital in applying new technology for recognition of different types of epilepsy.
• Cas Apanowicz has been designing and delivering BI/DW technology solutions for over 18 years. He has created a BI/DW open source software company and has North American patents in this field.
• Throughout his career, Cas has held consulting roles with Fortune 500 companies across North America, including Royal Bank of Canada, the New York Stock Exchange, the Federal Government of Canada, Honda, and many others.
• Cas holds a Master's Degree in Mathematics from the University of Krakow.• Cas is an author of North American patents and several publications by renowned publishers such as
Springer and Sherbrooke Hospital. He is also regularly invited to be a peer-reviewer by Springer publisher of many IT related publications.
A Little History
Big Data has received much attention over past two years, some calling it Ugly Data.
The challenge is dealing with the “mountains of sand” – hundreds, thousands and is cases millions of small, medium, and large data sets which are related, but unintegrated
IT is overtaxed and unable to integrate vast majority of data New class of software needed to discover relationships
between related yet unintegrated data sets
Cloud
Current BI
Data Analyses Data Cleansing Entity Relationship Modeling Dimensional Modeling Database Design & Implementation Database Population through ETL/ELT Downstream Applications linkage - Metadata Maintaining the processes
Source Data
Extensive processes and costs:
BI and Hadoop
Data Marts
Analytical Database
Analytical Database
Analytical Database
Analytical Database
Analytical Database
BI Reference Architecture
Metadata ManagementSecurity and Data Privacy
System Management and AdministrationNetwork Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Web Browser
Portals
Devices(ex.: mobile)
Web Services
Access
Collaboration
Bu
sin
ess
Ap
plic
atio
ns Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
EmbeddedAnalytics
Analytics
OperationalData Stores
DataWarehouse
Data Marts
StagingAreas
Metadata
Data Repositories
Extraction
Transformation
Load / Apply
Synchronization
Transport /Messaging
Information Integrity
Data Integration
Data Flow and Workflow
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
InvoiceePOS
Other
HDFS
AnalyticalData Marts
HCatalog
– Data Lake
Sqoop
MapReduce/PIG
Load / Apply
Single Source
HCatalog & PigCan work with most
ETL tools on the market
Transport /Messaging
Metadata Management - HCatalog
Metadata ManagementSecurity and Data Privacy
System Management and AdministrationNetwork Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Web Browser
Portals
Devices(ex.: mobile)
Web Services
Access
Collaboration
Bu
sin
ess
Ap
plic
atio
ns Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
EmbeddedAnalytics
Analytics
OperationalData Stores
DataWarehouse
Data Marts
StagingAreas
Metadata
Data Repositories
Extraction
Transformation
Load / Apply
Synchronization
Transport /Messaging
Information Integrity
Data Integration
Data Flow and Workflow
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
InvoiceePOS
Other
BI Reference Architecture
Metadata ManagementSecurity and Data Privacy
System Management and AdministrationNetwork Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Web Browser
Portals
Devices(ex.: mobile)
Web Services
Access
Collaboration
Bu
sin
ess
Ap
plic
atio
ns Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
EmbeddedAnalytics
Analytics
OperationalData Stores
DataWarehouse
Data Marts
StagingAreas
Metadata
Data Repositories
Extraction
Transformation
Load / Apply
Synchronization
Transport /Messaging
Information Integrity
Data Integration
Data Flow and Workflow
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
InvoiceePOS
Other
Transport /Messaging
HCatalog – Hadoop metadata repository and management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.
Extraction is an application used to transfer data, usually from relational databases to a flat file, which can then be use to transport to a landing are of a Data Warehouse and ingest into BI/DW environment.
BI Reference ArchitectureExtraction
Sqoop – is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Exports can be used to put data from Hadoop into a relational database.
Source
Extract Target Source Target
Sqoop
Current BI Proposed BI
sftp
Database extract
MapReduce – A framework for writing applications that processes large amounts of structured and unstructured data in parallel across large clusters of machines in a very reliable and fault-tolerant manner.
Pig – A platform for processing and analyzing large data sets. Pig consists on a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs.
Transformation
Landing
Staging
DW
HDFS
DM
Current BI Proposed BI
DM
MapReduce/PigComplex ETL
Complex ETL
Complex ETL
Load / Apply
Staging
DW
DM
Current BI Proposed BI
DM
SynchronizationSynchronization – The ETL process takes source data from staging, transforms using business rules and loads into central repository DW. In this scenario, in order to retain information integrity, one has to put in place a synchronization checks & correction mechanism.
HDFS as a Single Source – In the proposed solution HDFS acts as a single source of data so there is no danger of desinhronization. The inconsistencies resulted from duplicated or inconsistent data will be reconciled with assistance of HCatalog and proper data governance.
Staging
DWLanding
Synchronization
Source DM
HDFSSource DM
Information Integrity
Current – Currently there is no special approach to the data quality other than imbedded into the ETL processes and logic. There are tools and approaches to implement QA & QC.
Hadoop – More focused approach - While we use HDFS as a one big “Data Lake” QA and QC will be applied at the Data Mart Level where the actual transformations will occur, hence reducing the overall effort. QA & QC will be an integral part of Data Governance and augmented by usage of HCatalog.
Metadata ManagementSecurity and Data Privacy
System Management and AdministrationNetwork Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Web Browser
Portals
Devices(ex.: mobile)
Web Services
Access
Collaboration
Bu
sin
ess
Ap
plic
atio
ns Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
EmbeddedAnalytics
Analytics
OperationalData Stores
DataWarehouse
Data Marts
StagingAreas
Metadata
Data Repositories
Extraction
Transformation
Load / Apply
Synchronization
Transport /Messaging
Information Integrity
Data Integration
Data Flow and Workflow
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
InvoiceePOS
Other
Data Repositories
OperationalData Stores
DataWarehouse
Data Marts
StagingAreas
Metadata
HDFS
HCatalog
HCatalog Metadata Management
HCatalog – A Hadoop metadata repository and management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.
BI Reference Architecture
Hadoop Distributed File System (HDFS) – A reliable and distributed Java-based file system that allows large volumes of data to be stored and rapidly accessed across large clusters of commodity servers
HCatalog Metadata ManagementSecurity and Data Privacy
System Management and AdministrationNetwork Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Web Browser
Portals
Devices(ex.: mobile)
Web Services
Access
Collaboration
Bu
sin
ess
Ap
plic
atio
ns Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
EmbeddedAnalytics
Analytics
Data Flow and Workflow
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
InvoiceePOS
Other
HDFS
AnalyticalData Marts
HCatalog
Data Repositories
Sqoop
MapReduce/PIG
Load / Apply
Single Source
HCatalog & PigCan work with
Informatica
Data Integration
Transport /Messaging
BI Reference Architecture
Capability Current BI Proposed BI ExpectedChange
Data Sources Source Applications Source Applications No
Data Integration
Extraction from Source DB Export Sqoop On-to-one change
Transport/Messaging SFTP SFTP No
Staging Area Transformations/Load
Complex ETL Code None required eliminated
Extract from Staging Complex ETL Code None required eliminated
Transformation for DW Complex ETL Code None required eliminated
Load to DW Complex ETL, RDBMS None required eliminated
Extract from from DW, Transformation and load to DM
Complex ETL code & process to feed DM
MapReduce/Pig simplified transformations from HDFS to DM
Data Quality , Balance & Controls
Imbedded ETL Code MapReduce/Pig in conjunction with HCatalog; Can also coexist with Informatica
Yes
BI Reference Architecture
Capability Current BI Proposed BI ExpectedChange
Data Repositories
Operational Data Stores Additional Data Store (currently sharing resources with BIDW)
No additional repository. The BI consumption implemented through appropriated DM
Elimination of additional data store
Data Warehouse Complex Schema, Expensive platform. Requires complex modeling and design for any new data element
Eliminated. All data is collected in HDFS and available for feeding all required Data Marts (DM) - NO Schema on Write.
Eliminated
Staging Areas Complex Schema, Expensive platform. Requires complex design with any new data element
Eliminated. All data is collected in HDFS and available for creation of Data Marts
Eliminated
Data Marts Dimensional Schema Dimensional Schema No change
BI Reference Architecture
Capability Current BI Proposed BI ExpectedChange
Metadata Not Implemented HCatalog Simplified due to simplified processing & existence of native metadata management system.
Security Mature Enterprise Mature Enterprise guaranteed by Cloud provider
Less maintenance
Analytics WebFocus, Microstrategy, Pentaho, SSRS, etc.
WebFocus, Microstrategy, Pentaho, SSRS, etc.
No change
Access Web, mobile, other Web, mobile, other No change
BI Reference Architecture
Business Case
Solution Component Traditional/Original Proposed DW Discovery
Implementation Time 6 Months 2 Months
Cost of Implementation $975,000 $197,000
Number of Resources involved in Implementation
17 4
Maintenance Estimated Cost
$195,000 $25,000
The client has internally developed BI component strategically positioned in the BI ecosystem. Cas Apanowicz of IT Horizon Corp. was retained to evaluate the solution. The Data Lake approach was recommended resulting in total saving of $778,000 and shortening the implementation time from 6 to 2 month: