Date post: | 15-May-2015 |
Category: |
Technology |
Upload: | datameer |
View: | 633 times |
Download: | 1 times |
© 2012 Datameer, Inc. All rights reserved.
© 2012 Datameer, Inc. All rights reserved.
BI, Hive or Big Data Analytics?
© 2012 Datameer, Inc. All rights reserved.
View the Recording of these Slides!
You can view the full recording of this on-demand webinar with slides at:
http://info.datameer.com/Slideshare-BI-Hive-Big-Data-Analytics.html
!
© 2012 Datameer, Inc. All rights reserved.
About our Speaker!Todd Nash!!Todd is a founding Principal at CBIG Consulting, a professional services firm that helps clients leverage their data assets to produce timely, effective business strategies and tactical decisions. Todd leads CBIG’s eastern region consulting practice in the development, implementation, and execution of business intelligence and Big Data methodologies, cloud-based analytics strategies, and complex data warehousing solutions.!!Todd graduated from Clemson University with a Bachelor of Science degree in Management Information Systems.!
© 2012 Datameer, Inc. All rights reserved.
About our Speaker!Eduardo Rosas!!Eduardo Rosas is Vice President of Services at Datameer and brings over 12 years of software implementation experience to the table.!!In this role, Eduardo is focused on delivering repeatable, high quality level of services and support to help clients achieve their goals. !!Prior to Datameer, Eduardo spent 11 years at Trintech where he focused on managing a team of Technical Consultants and implementing global Java web based solutions. Eduardo is originally from San Jose, CA and graduated from Santa Clara University.!!
Agenda
• Problem Statement – Business & Technical • POC Technical Solu;on – High-‐level and Detailed • Results • Lessons Learned
Copyright © 2013 CBIG Consul;ng 5
PROBLEM STATEMENT
Copyright © 2013 CBIG Consul;ng 6
Business Problem Statement
Copyright © 2013 CBIG Consul;ng 7
SEARCH
IMPRESSIONS
CLICK-‐THRU
LEAD
Breadth: • Searches to Impressions to Click Thru to Leads • Website op;miza;on • Customer op;miza;on & upgrades • Market op;miza;on Depth: • Can the search criteria be op;mized? • Conversion of impressions based on refinement of search? • Which product mix of impressions get the greatest click thru • What is the impact of ameni;es to leads? • What addi;onal features get used to convert to leads?
A Real Estate .com business makes money in two ways: 1. Property Owners adver;se proper;es 2. Ancillary businesses adver;se services This site needs the analy;cs to show customers the return on their investment
Technical Problem Statement
• Search & Impressions volume too large to build cube and provide deep analytics • This has a negative impact on all reporting and performance of the entire system • The business is unable to determine the value of all the data; has requests to add more • Evaluating options to increase environment or look for alternatives • POC to evaluate how Hadoop, Amazon cloud and Datameer could support challenge
Copyright © 2013 CBIG Consul;ng 8
Lookup Data
Master Data
Search
Web Ac7vity
Data
Movem
ent
Source
Source
Source
Source
ODS
Data M
ovem
ent
EDW
Data M
ovem
ent
Search Cube
Sales Cube
Marke7ng Cube Se
rvice
Search & Impression
Search & Impression
Technical Problem Statement
• Search & Impressions volume too large to build cube and provide deep analytics • This has a negative impact on all reporting and performance of the entire system • The business is unable to determine the value of all the data; has requests to add more • Evaluating options to increase environment or look for alternatives • POC to evaluate how Hadoop, Amazon cloud and Datameer could support challenge
Copyright © 2013 CBIG Consul;ng 9
Lookup Data
Master Data
Search
Web Ac7vity
Data
Movem
ent
Source
Source
Source
Source
ODS
Data M
ovem
ent
EDW
Data M
ovem
ent
Sales Cube
Marke7ng Cube Se
rvice
Search & Impression
Search
Problem Statement – Success Criteria
Copyright © 2013 CBIG Consul;ng 10
Objec7ve: To prove that the Hadoop architecture is an excellent op;on for the business to interact with large data and find dataset and rela;onships that require deeper analy;cs. Original Scope & Goals: • Bring in one years worth of data from 6 tables, into the Amazon Cloud Hadoop environment.
• IT resources will be able to extract the data from these tables and load them into .CSV files.
• The success criteria for this stream of work will be: ü Amazon Hadoop cloud environment & account is setup. ü Search Analy;cs data loaded into the Amazon Hadoop cloud ü Business is able to execute and perform analy;cs on Search Analy;cs data
that is stored in Hadoop with acceptable performance. ü Gain analy;cal insights with new solu;on
POC TECHNICAL SOLUTION
Copyright © 2013 CBIG Consul;ng 11
POC Technical Solu;on – High Level
12
Web Ac7vity History
Lookup Data
Amazon Web Services (Cloud)
AWS S3
AWS EMR
(Hadoop)
Datameer (Data
Discovery)
Web Portal (Widget Based
UI)
Copyright © 2013 CBIG Consul;ng
WebVisit
WebSearch
WebLead
Web Impressions
WebClicks
POC Technical Solu;on -‐ Detailed
AllLeads Da
ta M
ovem
ent
Amazon Cloud
S3
WebVisit
WebSearch
WebLead
Web Impressions
WebClicks
AllLeads
Phone Leads
Other Leads
LR Apts IMPS
Generic Ac;vity
EmailLeads
Affliate
Contaniner Type
Email Type
Event Type
Lead Type
SearchType
Property List
Product ID
PhoneType
PageType
Site SubSite
Phone Leads
Other Leads
LR Apts IMPS
Generic Ac;vity
EmailLeads
Affliate
Contaniner Type
Email Type
Event Type
Lead Type
SearchType
Property List
Product ID
PhoneType
PageType
Site SubSite
Hadoop
Data Workbooks AllLeads WebClicks
Web Impressions WebLeads WebSearch WebVisits
Use Case Workbooks
Use Case1 Use Case 2
Data M
ovem
ent
Addi7onal Data Workbooks
Addi7onal Use Cases
RESULTS
Copyright © 2013 CBIG Consul;ng 14
POC Results Success Criteria Results
Copyright © 2013 CBIG Consul;ng 15
Environment setup within the 1st couple of days Loaded significantly more data than planned for more robust analy;cs Business leveraged Datameer to execute use cases; executed ~20 addi;onal without IT help Queries executed to comple;on. Some took seconds, some took minutes and some required overnight. 1st ;me able to run these analy;cs. Found pajerns and rela;onships contrary to assump;ons. Will be upda;ng service offerings & marke;ng plans because of POC
Hadoop, Amazon, Datameer environment setup Able to load 1 years worth of data – nearly 1.3 TB Business able to execute and perform analy;cs Users provided acceptable performance Gain new insights
LESSONS LEARNED
Copyright © 2013 CBIG Consul;ng 16
Lessons Learned
Copyright © 2013 CBIG Consul;ng 17
GETTING DATA TO HADOOP
Hadoop is file structure Finding the right delimiter
Integra;ng data Requires ETL
Data cleansing can be big Several itera;ons required
PEOPLE
Remember change mgmt Educa;on new methods & tools
HADOOP
Hadoop is batch Answers one thing at a ;me
Analy;cs Move to database w/ tools
CLOUD
Cloud flexible Easy setup and scaling
Performance & sizing Sizing the cloud is challenging
Cost for performance TBs with support becomes costly
© 2012 Datameer, Inc. All rights reserved.
© 2012 Datameer, Inc. All rights reserved.
So what about open source tools like hive?
© 2012 Datameer, Inc. All rights reserved.
Hive…!
! Prerequisites!• Must have data in hadoop!• The data must be CLEAN!• Schema must be applied to the
data by creating a hive table!
! Goal of hive!• Eases the complexity of writing
MapReduce jobs by providing the technical user a set of tools that are more familiar with via sql!
! Who can use hive?!• SQL Users can pick up hql basics fairly
quickly!
© 2012 Datameer, Inc. All rights reserved.
What is hive really good at?!! Hive is good in environments where we have clean prepared
data that doesn’t change often already in hadoop!!! Resembles a language that many IT folks are already familiar
with.!!! Hive can help a user trying to identify a reporting trend!!! User defined fields (UDFs) can be used to reuse functions!
© 2012 Datameer, Inc. All rights reserved.
<< - Start of Hive script ->>--Create an TEMP Housing TableCREATE EXTERNAL TABLE MY_TABLE(num_ods string,num_bus_id int,um_ctry_cd int,prod_id string,rng_svc_cd string,rng6 string,bin string,bin_bus_id_enr int,bin_ctry_cd int,cd_fmt_a_2 string,cd_enr string,rsn_us_ind string,x_bus_id int,flg_enr string,my_dt string,user_id string,mthd_cd_enr string,tran_seq_id string,cd_enr2 string,us_amt string,moto_cd string,fee_curr_cd int,fee_desc_num string,fee_sgn_amt string,us_fee_sgn_amt string,mkt_spec string,catg_cd int,city_enr string,ctry_cd_enr int,dba_id int,nm_dscrptr string,geo_id int,geo_phone_num string,tier_cd string,msa string,nrmlzd_id int,pstl_cd string,b_st_cd_enr string,b_store_id string,b_vrfcn_val string,ntwrk_id int,site string,
entry_mode_cd string,term_cpbty_cd string,sub_typ_cd string,dt string,id_num_enr int,prod_num int,prod_ppd_sub_typ_cd string,prod_typ_cd_enr string,prod_typ_ext_enr string,promo_cd string,promo_typ string,rwds_pgm_id_enr string,tran_cd string,tran_gmt_dt string,tran_gmt_tm string,tran_id string,unfrzn_acct_num_bus_id_enr int,unfrzn_arn_bin_bus_id_enr int,usage_cd_enr string,Other_amt string,curr_cd int,dt string,)COMMENT "THIS IS MY TEMP TABLE";--INSERT DATA INTO MY_TABLEINSERT OVERWRITE MY_TABLE select * , SUM(us_tran_amt) AS SALES_VOL,SUM(US_FEE_SGN_AMT) AS US_FEE_SGN_AMT,COUNT(*) AS TRAN_COUNT,MIN(ACTIVE_DT) AS FIRST_ACTIVE_DT,MAX(SEARCH_DT) AS LAST_SEARCH_DT,MAX(customer_biz_id) AS customer_biz_id, MAX(PGM_ID_ENR) AS PGM_ID_ENR, MAX(CUST_PROD_ID) AS CUST_PROD_ID , MAX(POD_ID_NUM_ENR) AS POD_ID_NUM_ENR,MAX(PROD_TYPE)AS PROD_TYPE, MAX(SUB_TYPE) AS SUB_TYPE, 1 as IDfrom MY_TABLEWHERE dt like '2012%' GROUP BY customer_biz_id, PGM_ID_ENR, CUST_PROD_ID,
eci_moto_cd, catg_cd, city_enr,ctry_cd_enr, pstl_cd, pod_id, prod_num, SUB_TYPE;--CREATE TEMP LOOKUP TABLE CREATE EXTERNAL TABLE TEMP_LOOKUP(acct_num bigint,acct_sta_cd string,acct_zip_cd string,rwrd_pgm_id string,pgm_ref_cd string,acct_prod_id string,bus_id int,bin int,status string,pgm_eff_dt string,dt string,)COMMENT "THIS IS TEMP LOOKUP TABLE";--INSERT DATA INTO ITINSERT OVERWRITE MY_LOOKUPSELECT *, 1 as cmf_indFROM LOOKUPWHERE DT = '201211';--Do a Full Outer JoinSELECT * FROM MY_TABLE mtFULL OUTER JOIN MY_LOOKUP mlON mt.member_id = ml.member_id;
Some troubles!
! No way to get data in hadoop!! No data validation / may throw data away!! Security !! Sharing code via teams is a challenge!! No visualization!
© 2012 Datameer, Inc. All rights reserved.
… but it’s free right?!! "Time to create Hive":
Any machine-generated data (or anything semi/unstructured) must first be parsed by writing !!!MapReduce or Pig/Python programs. Time-to-market disadvantage.
Table definition is a manual effort (though this can be made easier by 3rd party tools).!
! "Time to maintain Hive": Hive data models (tables) are most likely static, shared objects maintained and controlled by a few people who own the schema !Hive is also more of a black box for new employees coming in (so employee churn creates more maintenance effort). !
!! Cost to implement Hive:
This is mostly down to the human capital (expensive developers), and don't forget the prerequisite cost of implementing the data ingestion stage of the pipeline (populating the warehouse by writing MapReduce programs or other programs parsing/loading the data). !
© 2012 Datameer, Inc. All rights reserved.
Business decsion!
!! Do I train my engineers on a language or
eliminate the need from this by taking the problem directly to the business user.!
!
© 2012 Datameer, Inc. All rights reserved.
So what would my hive resource need to know?!! Hive QL (different dialect than ANSI standard SQL)!
! MapReduce TUNING parameters. (to name a few)!• Data block size!• Number of mappers/reducers!• Compression at map out level; result compression; what codec to use!• io.sort.factor !!
! Access to hive is mainly done via Command line interface!
© 2012 Datameer, Inc. All rights reserved.
How does Datameer do it differently!
© 2012 Datameer, Inc. All rights reserved.
Questions and Answers!
© 2012 Datameer, Inc. All rights reserved.
Online Resources
§ Try Datameer: www.datameer.com!§ Follow us on Twitter @datameer!!
!