Date post: | 17-Jul-2015 |
Category: |
Technology |
Upload: | qubole |
View: | 185 times |
Download: | 0 times |
Today’s speakers
Yekesa Kosuru
VP of Engineering,
DataXu
Ashish Dubey
Solutions Architect,
Qubole
Scott Ward
Solutions Architect,
AWS
Agenda
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers
Housekeeping
• The recording link will be distributed to all registrants via email after
the webinar next week
• Please submit your questions and comments using the Chat with
Presenters box located at the bottom left corner of your screen
Agenda
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers
Creating Value from Data Assets
Recommendations, Collective Intelligence
Machine Learning
Visualization
DashboardsBusiness Intelligence
Measuring Functionality and Services
Ad Hoc QueriesA/B Testing
Hypothesis Testing & Predictions
Statistical Analysis
Learning from Social Media Conversations
Sentiment Analysis
SOCIAL
BIG DATA
Machine Learning DashboardsBusiness Intelligence
Ad Hoc QueriesA/B Testing
Statistical Analysis
Sentiment Analysis
Big Data AWS Cloud
Potentially Massive Data Sets Massive, virtually unlimited capacity
Iterative, experimental style of data manipulation
and analysis
Iterative, experimental style of infrastructure
deployment/usage
Frequently not a steady-state workload;
peaks and valleys
Efficient with highly variable workloads
Time to results is keyParallel compute clusters from single data source
Hard to configure/manageManaged services for data storage and analysis
Big Data + AWS
AWS Data Services
Data
Velocity
Variety
Volume
Structured, Unstructured, Text, Binary
Gigabytes, Terabytes, Petabytes
Millisecond, Second, Minute, Hour, Day
EC2EBS
Instance Storage
RedshiftRDS
SQL Stores
EMR
Hadoop
DynamoDB
NoSQL
Kinesis
Stream
Storage Services
S3 Cloud
FrontGlacier
Elasticache
Caching
Data
Pipeline
Orchestrate
Amazon Elastic Map ReduceHosted Hadoop Framework
• Easy to use and fully managed
• Secure
• Resizable clusters to support processing needs
• Support for EC2 spot instances
• Use many query tools to support analysis of
your data
– Hive, Pig, Hbase, Spark, BI Tools, etc
• EMR-FS for an S3 backed data store.
• Direct integration with other AWS data stores
– S3, Redshift, DynamoDB
Master instance group
Task instance groupCore instance group
HDFS HDFS
Amazon S3Amazon
RedshiftAmazon
DynamoDB
Amazon EMR Architecture
EMR Security
• Security groups for master and
slave instances
• Instances launch in your VPC
• Encrypt data in S3
• Control who can access S3 data
• API requests required signed key
Master instance group
Task instance groupCore instance group
HDFS HDFS
Amazon S3Amazon
RedshiftAmazon
DynamoDB
Amazon RedshiftPetabyte Scale Data Warehouse
• Fully managed data warehouse solution
• Able to achieve petabyte scale at $1000
per TB per year
• Integrates with existing data warehouse
tools
• Scales through columnar storage and
parallel query execution
• Data load directly from S3
• Integration with Amazon EMR
Amazon Redshift Architecture
• Leader Node– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via Amazon S3
– Parallel load from Amazon DynamoDB, Amazon EMR, Amazon S3, HDFS/SSH
• Two hardware platforms– Optimized for data processing
– DW1: HDD; scale from 2TB to 1.6PB
– DW2: SSD; scale from 160GB to 256TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
• SSL to secure data in transit
• Encryption to secure data at rest
– AES-256; hardware accelerated
– All blocks on disks and in
Amazon S3 encrypted
– HSM/CloudHSM
• No direct access to compute
nodes
• Amazon VPC support
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
Security
Group
JDBC/ODBC
Amazon Redshift Security
Agenda
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers
2014 Usage Statistics for Qubole on AWS:
• Total QCUH processed in 2014 = 40.6 million
• Total nodes managed in 2014 = 2.5 million
• Total PB processed in 2014 = 519
Operations
Analyst
Marketing OpsAnalyst
Data
Architects
Business
Users
Product
Support
Customer
SupportDeveloper
Sales
Ops
Product
Managers
DeveloperTools
Service Management
Data Workbench
Cloud Data Platform
BI & DWSystems
• SDK
• API
• Analysis
• Security
• Job Scheduler
• Data Governance
• Analytics templates
• Monitoring
• Support
• Collaboration
• Workflow &
Map/Reduce
• Auto Scaling
• Cloud Optimization
• Data Connectors• YARN • Presto & Hive• Spark & Pig
Hadoop Ecosystem (Apache Open Source)
Agenda Slide
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers
| 26
DataXu Introduction
Disruptive on-demand software platform relied upon by the world’s
leading brands
A petabyte scale marketing cloud that enables Fortune 500 brands to
manage data, insight and action to maximize Marketing ROI
The industry’s #1 rated programmatic marketing technology
spun out of MIT by the founders
One of the fastest growing companies in the Inc. 500
| 27
DataXu Quick Statistics
Big data + Real time decisions
Big Data Processing
13 petabytesof data
20 terabytes/dayconsumer data intake
Real-TimeDecisioning
42 billiondecisions per second
1,500,000Inbound Queries Per Second
Dozens of algorithms across mobile,
social, native, display, video and TV
Predictive Modeling
Executing 10,000+investments simultaneously
10M variablesconsidered per investment
decision using next gen machine learning
Enterprise-
Cloud
Infrastructure
14data centers
35,000+CPU cores
Patent portfolio for real-time decision systems
Exclusive license from MIT to Algebra Of Systems IPR
| 28
Programmatic buying exploits real time signals to
drive greater ROI.
Analyze the attributes
available at bidding time
Assess the value of each
impression to determine a bid
price and the creative to serve
Learn from served
impressions to adjust future
bidding and creative delivery
OptimizeAppraiseAnalyze
Context Geo O.S.
Time Demo Etc.
| 29
• On-premise and Cloud
• Why Cloud/AWS
– Automation, API driven
– All Data in One Place
– Improved Testability
– Deep Security
– Breadth and Depth of Services
– Costs, Pay As You Go
– Auto Scaling (Scalability, Elasticity)
– Disaster Recovery and Business Continuity
DataXu in the Cloud
AWS
| 30
DataXu Data Flows in AWS
Producers Continuous
Processing
StorageAnalytics
CDN
Real Time
Bidding
Retargeting
Platform
Qubole
KinesisS3
Redshift
Machine
LearningStreaming
Data Collection
Analysts
Data Scientists
Engineers
| 31
Why Qubole
Managed Service
• Auto Scaling
• Spot Pricing
• No Opex
• Redundant Clusters
• Data Security
Single Unified Interface
• Rich Unified Experience
• Data Discovery tool
• Query Templates
• Administration and Monitoring
Performance Optimizations
• Overall better performance than other
Hadoop clusters in the cloud
Automation• Workflow, Scheduler
• SDK
Support • 24 X 7 deep expertise support
| 32
Unified Experience
Operations
Analyst
Marketing
Ops
Analyst
Data
Architect
Busines
s
Users
Product
SupportCustomer
Support
Developer
Sales Ops
Product
Managers
Easy of use for anyone
| 33
• Use VPC, pick AZ’s appropriately to match reservations
• Use hybrid spot pricing strategy
• Use tags for better reporting
• Seek Qubole help for cluster tuning
Qubole Cluster Best Practices
| 34
Data Security & Privacy
• AWS offers comprehensive data security
• Security & Privacy
– VPC
– IAM Policies, Users, Roles
– S3 Buckets, Bucket Policies & HTTPS
– Security Groups, Whitelist IP CIDR
– Key Management Service & CloudHSM
– Server Side and Client Side Encryption
| 35
Right tool for right workload
Large scale ETL
Interactive
Discovery
Queries
Machine
Learning/Real time
queries
High Performance
DW
Queries/Reporting
backend
Use Case / Technology
Questions?
DataXu
Yekesa Kosuru
www.dataxu.com
Qubole
Ashish Dubey
www.qubole.com
AWS
Scott Ward
aws.amazon.com