The Data Lake A New Solution to Business Intelligence.

The Data LakeA New Solution to Business Intelligence

Agenda

• Cas Apanowicz – An Introduction• A Little History• Traditional DW/BI• What is Data Lake• Why is better?• Architectural Reference• New Paradigm and Architectural Reference• Future of Data Lake• Q&A • Appendix A

Cas Apanowicz• Cas is the founder and was the first CEO of Infobright – the first Open Source Data Warehouse

company co-owned by San Microsystems and RBC Royal Bank of Canada• He is an accomplished IT consultant and entrepreneur, Internationally recognized IT practitioner who

has served as a co-chair and speaker on International conferences. • Prior to Infobright, Mr. Apanowicz founded Cognitron Technology, which specialized in developing

data mining tools, many of which were used in the health care field to assist in customer care and treatment.

• Previous to Cognitron Technology, Mr. Apanowicz worked in the Research Centre at BCTel where he developed an algorithm that measured customer satisfaction. At the same time, he was working in the Brain Center at UBC in Vancouver applying ground-breaking algorithms for brain reading interpretation. As well, he offered his expertise to Vancouver General Hospital in applying new technology for recognition of different types of epilepsy.

• Cas Apanowicz has been designing and delivering BI/DW technology solutions for over 18 years. He has created a BI/DW open source software company and has North American patents in this field.

• Throughout his career, Cas has held consulting roles with Fortune 500 companies across North America, including Royal Bank of Canada, the New York Stock Exchange, the Federal Government of Canada, Honda, and many others.

• Cas holds a Master's Degree in Mathematics from the University of Krakow.• Cas is an author of North American patents and several publications by renowned publishers such as

Springer and Sherbrooke Hospital. He is also regularly invited to be a peer-reviewer by Springer publisher of many IT related publications.

A Little History

Big Data has received much attention over past two years, some calling it Ugly Data.

The challenge is dealing with the “mountains of sand” – hundreds, thousands and is cases millions of small, medium, and large data sets which are related, but unintegrated

IT is overtaxed and unable to integrate vast majority of data New class of software needed to discover relationships

between related yet unintegrated data sets

Cloud

Current BI

Data Analyses Data Cleansing Entity Relationship Modeling Dimensional Modeling Database Design & Implementation Database Population through ETL/ELT Downstream Applications linkage - Metadata Maintaining the processes

Source Data

Extensive processes and costs:

BI and Hadoop

Data Marts

Analytical Database

Analytical Database

Analytical Database

Analytical Database

Analytical Database

BI Reference Architecture

Metadata ManagementSecurity and Data Privacy

System Management and AdministrationNetwork Connectivity, Protocols & Access Middleware

Hardware & Software Platforms

Web Browser

Portals

Devices(ex.: mobile)

Web Services

Access

Collaboration

Bu

sin

ess

Ap

plic

atio

ns Query &

Reporting

Data Mining

Modeling

Scorecard

Visualization

EmbeddedAnalytics

Analytics

OperationalData Stores

DataWarehouse

Data Marts

StagingAreas

Metadata

Data Repositories

Extraction

Transformation

Load / Apply

Synchronization

Transport /Messaging

Information Integrity

Data Integration

Data Flow and Workflow

Enterprise

Unstructured

Informational

External

Data Sources

Supplier

Orders

Product

Promotions

Customer

Location

InvoiceePOS

Other

HDFS

AnalyticalData Marts

HCatalog

– Data Lake

Sqoop

MapReduce/PIG

Load / Apply

Single Source

HCatalog & PigCan work with most

ETL tools on the market


Metadata Management - HCatalog




Web Browser

Portals


Web Services

Access

Collaboration

Bu

sin

ess

Ap

plic

atio

ns Query &

Reporting

Data Mining

Modeling

Scorecard

Visualization

EmbeddedAnalytics

Analytics


DataWarehouse

Data Marts

StagingAreas

Metadata

Data Repositories

Extraction

Transformation

Load / Apply

Synchronization



Data Integration


Enterprise

Unstructured

Informational

External

Data Sources

Supplier

Orders

Product

Promotions

Customer

Location

InvoiceePOS

Other





Web Browser

Portals


Web Services

Access

Collaboration

Bu

sin

ess

Ap

plic

atio

ns Query &

Reporting

Data Mining

Modeling

Scorecard

Visualization

EmbeddedAnalytics

Analytics


DataWarehouse

Data Marts

StagingAreas

Metadata

Data Repositories

Extraction

Transformation

Load / Apply

Synchronization



Data Integration


Enterprise

Unstructured

Informational

External

Data Sources

Supplier

Orders

Product

Promotions

Customer

Location

InvoiceePOS

Other


HCatalog – Hadoop metadata repository and management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.

Extraction is an application used to transfer data, usually from relational databases to a flat file, which can then be use to transport to a landing are of a Data Warehouse and ingest into BI/DW environment.

BI Reference ArchitectureExtraction

Sqoop – is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Exports can be used to put data from Hadoop into a relational database.

Source

Extract Target Source Target

Sqoop

Current BI Proposed BI

sftp

Database extract

MapReduce – A framework for writing applications that processes large amounts of structured and unstructured data in parallel across large clusters of machines in a very reliable and fault-tolerant manner.

Pig – A platform for processing and analyzing large data sets. Pig consists on a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs.

Transformation

Landing

Staging

DW

HDFS

DM


DM

MapReduce/PigComplex ETL

Complex ETL

Complex ETL

Load / Apply

Staging

DW

DM


DM

SynchronizationSynchronization – The ETL process takes source data from staging, transforms using business rules and loads into central repository DW. In this scenario, in order to retain information integrity, one has to put in place a synchronization checks & correction mechanism.

HDFS as a Single Source – In the proposed solution HDFS acts as a single source of data so there is no danger of desinhronization. The inconsistencies resulted from duplicated or inconsistent data will be reconciled with assistance of HCatalog and proper data governance.

Staging

DWLanding

Synchronization

Source DM

HDFSSource DM


Current – Currently there is no special approach to the data quality other than imbedded into the ETL processes and logic. There are tools and approaches to implement QA & QC.

Hadoop – More focused approach - While we use HDFS as a one big “Data Lake” QA and QC will be applied at the Data Mart Level where the actual transformations will occur, hence reducing the overall effort. QA & QC will be an integral part of Data Governance and augmented by usage of HCatalog.




Web Browser

Portals


Web Services

Access

Collaboration

Bu

sin

ess

Ap

plic

atio

ns Query &

Reporting

Data Mining

Modeling

Scorecard

Visualization

EmbeddedAnalytics

Analytics


DataWarehouse

Data Marts

StagingAreas

Metadata

Data Repositories

Extraction

Transformation

Load / Apply

Synchronization



Data Integration


Enterprise

Unstructured

Informational

External

Data Sources

Supplier

Orders

Product

Promotions

Customer

Location

InvoiceePOS

Other

Data Repositories


DataWarehouse

Data Marts

StagingAreas

Metadata

HDFS

HCatalog

HCatalog Metadata Management

HCatalog – A Hadoop metadata repository and management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.


Hadoop Distributed File System (HDFS) – A reliable and distributed Java-based file system that allows large volumes of data to be stored and rapidly accessed across large clusters of commodity servers

HCatalog Metadata ManagementSecurity and Data Privacy



Web Browser

Portals


Web Services

Access

Collaboration

Bu

sin

ess

Ap

plic

atio

ns Query &

Reporting

Data Mining

Modeling

Scorecard

Visualization

EmbeddedAnalytics

Analytics


Enterprise

Unstructured

Informational

External

Data Sources

Supplier

Orders

Product

Promotions

Customer

Location

InvoiceePOS

Other

HDFS

AnalyticalData Marts

HCatalog

Data Repositories

Sqoop

MapReduce/PIG

Load / Apply

Single Source

HCatalog & PigCan work with

Informatica

Data Integration



Capability Current BI Proposed BI ExpectedChange

Data Sources Source Applications Source Applications No

Data Integration

Extraction from Source DB Export Sqoop On-to-one change

Transport/Messaging SFTP SFTP No

Staging Area Transformations/Load

Complex ETL Code None required eliminated

Extract from Staging Complex ETL Code None required eliminated

Transformation for DW Complex ETL Code None required eliminated

Load to DW Complex ETL, RDBMS None required eliminated

Extract from from DW, Transformation and load to DM

Complex ETL code & process to feed DM

MapReduce/Pig simplified transformations from HDFS to DM

Data Quality , Balance & Controls

Imbedded ETL Code MapReduce/Pig in conjunction with HCatalog; Can also coexist with Informatica

Yes



Data Repositories

Operational Data Stores Additional Data Store (currently sharing resources with BIDW)

No additional repository. The BI consumption implemented through appropriated DM

Elimination of additional data store

Data Warehouse Complex Schema, Expensive platform. Requires complex modeling and design for any new data element

Eliminated. All data is collected in HDFS and available for feeding all required Data Marts (DM) - NO Schema on Write.

Eliminated

Staging Areas Complex Schema, Expensive platform. Requires complex design with any new data element

Eliminated. All data is collected in HDFS and available for creation of Data Marts

Eliminated

Data Marts Dimensional Schema Dimensional Schema No change



Metadata Not Implemented HCatalog Simplified due to simplified processing & existence of native metadata management system.

Security Mature Enterprise Mature Enterprise guaranteed by Cloud provider

Less maintenance

Analytics WebFocus, Microstrategy, Pentaho, SSRS, etc.

WebFocus, Microstrategy, Pentaho, SSRS, etc.

No change

Access Web, mobile, other Web, mobile, other No change


Business Case

Solution Component Traditional/Original Proposed DW Discovery

Implementation Time 6 Months 2 Months

Cost of Implementation $975,000 $197,000

Number of Resources involved in Implementation

17 4

Maintenance Estimated Cost

$195,000 $25,000

The client has internally developed BI component strategically positioned in the BI ecosystem. Cas Apanowicz of IT Horizon Corp. was retained to evaluate the solution. The Data Lake approach was recommended resulting in total saving of $778,000 and shortening the implementation time from 6 to 2 month:

Thank You

• Contact information:• Cas Apanowicz • [email protected] • 416-882-5464

• Questions?

Date post:	15-Jan-2016
Category:	Documents
Upload:	scott-sullivan
View:	217 times
Download:	0 times

The Data Lake A New Solution to Business Intelligence.

Documents