Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | hug-france |
View: | 285 times |
Download: | 4 times |
1
Data Wrangling sur Hadoop avec Spark
Paris Spark Meetup 03/04/17Victor CoustenobleTechnical regional manager [email protected]@vizanalytics
DATA WRANGLING
2
QUESTION ANALYZE INSIGHTDISCOVER STRUCTURE CLEANSE ENRICH VALIDATE PUBLISH
What is Data Wrangling?
3
DATA
Company Overview
Background➔ Headquartered in San Francisco, with offices in Boston,
London, Berlin, Paris➔ >100+ Employees➔ Created in 2012
Focus➔ 100% focused on Data Wrangling and Data Preparation➔ Accelerate time to value and business use of Big Data➔ Visual, interactive and Self-Service Data Preparation
4
5
Business System Data Machine Generated Data Third Party Data
Reporting / BIData Visualization
LOB IT
Explore Structure Clean Enrich Validate Publish
Distributed Data Platform
Predictive Analytics / Data Science
Machine Data /Enterprise Processes
Applications / processes
Reporting / Data driven decision
Recommendations /Data Mining
Self-service access for business analysts to rawdata operated under IT control
6
INTERACTIVE &VISUAL
PREDICTIVE &SUGGESTIONS
INTEROPERABLE
Trifacta Key Differentiators
Interoperable: Reduces Total Cost of Ownership
7
Interoperability with metadata repositories enables discoverability
and lineage for compliance & audit
Interoperability with existing security models prevents administering
another app
*Predictive Interaction for Data Transformation – Heer, Hellerstein & Kandel; Stanford University & University of California, Berkeley (2015)
Intelligent Execution* ensures Trifacta is highly performantboth now and in the future
8
Execution Architecture
Optimized processing for data not needing parallel processing
Future Technologies
Intelligent Execution
In-memory
MBs GBs TBs PBs
Data Volume
Exec
utio
n La
tenc
y
Immediate
Interactive
Batch
Intelligent Execution Architecture
Automatically selects the right execution engine for the data set being transformed
TRIFACTA
Trifacta Workflow in Hadoop
Sample Scale Up
RefineSample
Results
Identify/Register Data
1. Predictive Interaction
2.
Co
nsu
me
Schedulers
Monitor and Adjust
3.
Schedule
Visualization & Analysis
Secure AccessKerberos, LDAP…
CLI
How Does Trifacta’s Spark Work?
§ Yarn ressource manager§ Cluster deployment mode
12
Trifacta executes our own version of Spark in a “Cluster Deployment Mode” using the Hadoop cluster’s YARN resource manager.§ Trifacta’s Spark job lives in its own YARN container, separate from other Spark
jobs running on the same cluster.
Trifacta submits the following to YARN for execution across cluster:§ Spark v2.1.0 libraries§ Trifacta Transformation & Profiling libraries§ Transformation logic (DAG)§ Libraries are distributed & cached by YARN after initial load.
Spark jobs parameters (possible per user) :§ Executor parameters (memory size, nb vcores).§ Dynamic allocation (by default) for dynamic nb of executors depending of
YARN available ressources.§ Possible to assign jobs to specific YARN queue
How Does Trifacta’s Spark Work?
Trifacta Selected as OEM Partner for Google Cloud Dataprep Service
Trifacta Interface & Photon Engine Integrated within Google Cloud Ecosystem● Access & publish data from/to Google Cloud Storage & BigQuery● Compile recipes to Google Cloud Dataflow for fully-managed auto-scaling execution
Google Cloud Dataprep
Cloud Storage
BigQuery
Dataflow
Cloud Storage
BigQuery
Cloud DataprepINPUT OUTPUT
https://cloud.google.com/dataprep/
Storage
3rd PartyExperian,Nielson,FICO…
v
IT
LOB
Discovering Structuring Cleaning Enriching Validating Publishing
Ingestion Processing
DATA LAKE
Demonstration : Predict and Avoid Churn – Customer 360
Customer Data
Account Activity
Social Media
CRMContact / Status
VoiceText Data
TweetsHandles
ANALYSIS & VISUALIZATION
Trifacta: The Global Leader in Data Wrangling
No. 1 by Analysts
#1 End User Data Preparation Vendor
2015
Leader in Forrester Wave for Data Preparation Tools
2017
0
50 000No. 1 by Users
No. 1 by Customers
No. 1 by Partners
2016
Oct 2015 Oct 2016 Oct 2017
2017
MerciQuestions?
Télécharger Trifacta Wrangler trifacta.com/start-wrangling
[email protected]@vizanalytics