Date post: | 14-Apr-2017 |
Category: |
Technology |
Upload: | data-driven-innovation |
View: | 379 times |
Download: | 0 times |
Critical Breakthroughs and technicalChallenges in Big Data Driven Innovation
Paolo Spreafico
Head of EMEA Data Solution Engineers, Google Cloud Platform
Google Cloud Platform 2
Organize the world’s information and make it universally accessible and useful.Google’s Mission
2
“
#cloudconf2016
#cloudconf2016
Google Cloud Platform 5
By 2020, there will be 8 Billion connected smart phones
Source: Boston Consulting Group : The Mobile Revolution: How Mobile Technologies Drive a Trillion-Dollar ImpactIDC, 2015
— 2X more than today.And 32 Billion connected “IOT” devices
— 6X more than today.
Building what’s next 6
Source: IDC
increase in data (4ZB to 45ZB)
connected devices
of data “touched” by the cloud
40%35B10x
OrganisationData Questions
Tech
nolo
gy
Data is key (among others)
“Companies in the top third of their industry in the use of data- driven decision making were, on average, 5% more productive and 6% more profitable than their competitors.”
Andrew McAfee and Erik Brynjolfsson, MIT
What does Cloud 3.0 look like?
Google Cloud Platform 9
Storage Processing Memory Network
Single-node computing“Some assembly required”
True, on-demand cloud
An actual, global elastic cloud
Cloud 3.0
Invest your energy in great apps
Colocation
Your kit, someone else’s building.
Yours to manage.
Cloud 1.0Today's Cloud:
Virtualized Data Centers
Standard virtual kit, for rent. Still yours
to manage.
Cloud 2.0
Aut
omat
ion
Google Cloud Platform Vision
Messaging Big Data Containers NoSQL
http://googleasiapacific.blogspot.se/2015/06/growing-our-data-center-in-singapore.html
For the past 17 years, Google has been building out the fastest, most powerful, highest quality cloud infrastructure on the planet.
Edge locations in virtually every country in the world
Our Network
77Peering locations
10+ Years of Tackling Big Data Problems
Google Cloud Platform 13
Google Papers
20082002 2004 2006 2010 2012 2014 2015
GFS MapReduce
Flume Java Millwheel
OpenSource
2005
GoogleCloudProducts BigQuery Pub/Sub Dataflow Bigtable
BigTable Dremel PubSub
Apache Beam
Tensorflow
Google’s Data Services for everyone
Confidential + Proprietary
Storage and Databases
Cloud Storage
The Google Cloud data toolbox
Cloud SQL
Cloud Bigtable
Cloud Datastore
Big Data and Analytics
BigQuery
Cloud Pub/Sub
Cloud Dataflow
Cloud Dataproc
Cloud Datalab
Machine Learning
Cloud Machine Learning
Cloud Translate API
Cloud Vision API
Cloud Speech API
Confidential + Proprietary
A common configuration: draw conclusions
Events, metrics, etc.
Stream
Batch
Spreadsheets
BI Tools
Coworkers
Applications and Reports
Cloud Datalab
Visualization and BI
Co-workers
Batch
B CA
Raw logs, files, assets, Google
Analytics data etc.
A serverless big data stackthat scales automatically
Confidential & ProprietaryGoogle Cloud Platform 18
Complexities of Big Data ProcessingProgramming
Resource provisioning
Performance tuning
Monitoring
ReliabilityDeployment & configuration
Handling growing scale
Utilization improvements
Time to Understanding
Typical Big Data Processing
Confidential & ProprietaryGoogle Cloud Platform 19
Spend Time on ‘What’ not ‘How’
Time to Understanding
Big Data Processing with Google Cloud Platform
Programming
More time to dig into your data
Cloud 3.0 Big Data Lifecycle
Cloud Logs
Google App Engine
Google Analytics Premium
Cloud Pub/Sub
BigQuery Storage(tables)
Cloud Bigtable(NoSQL)
Cloud Storage(files)
Cloud Dataflow
BigQuery Analytics(SQL)
Capture Store Analyze
Batch
Process
Stream
Cloud Monitoring
Real-time analytics
Cloud Dataflow
Cloud ML
Real-timedashboard
Real-timealerts
Use
DataScientists
Analysts
Smartapps
Catalog & Data Lifecycle Automation
Cloud Datalab
Cloud Dataproc
Data Studio
Confidential & ProprietaryGoogle Cloud Platform 21
Emerging Big Data Challenges
Real-timedata ingestion
Machine learningat scale
Batch or streaming?
Analytics at the speed of thought
Batch or Streaming?Why do you have to choose?
Breakthrough #1
Google Cloud Platform Confidential & Proprietary 23
We don’t really use MapReduce anymoreUrs Hölzle
SVP TechnicalInfrastructure Google
“ ”
Confidential + Proprietary
A common configuration: capturing input
Cloud Pub/SubReliable, many-to-many, asynchronous messaging
Cloud StoragePowerful, simple and cost-effective object storage
Raw logs, files, assets, Google
Analytics data etc.
Events, metrics, etc.
Confidential + Proprietary
A common configuration: process and transform
Events, metrics, etc.
Cloud DataflowData processing engine forbatch and stream processing
Stream
Batch
Raw logs, files, assets, Google
Analytics data etc.
Confidential + Proprietary
A common configuration: process and transform
Events, metrics, etc.
Cloud DataflowData processing engine forbatch and stream processing
Stream
Batch
Cloud DataprocManaged Spark and Hadoop
Batch
Raw logs, files, assets, Google
Analytics data etc.
Confidential + Proprietary
A common configuration: analyze and store
Events, metrics, etc.
Stream
Batch
BigQueryExtremely fastand cheap on-demandanalytics engine
BigtableHigh performance NoSQL database for large workloadsBatch
Raw logs, files, assets, Google
Analytics data etc.
Confidential + Proprietary
A common configuration: draw conclusions
Events, metrics, etc.
Stream
Batch
Spreadsheets
BI Tools
Coworkers
Applications and Reports
Cloud Datalab
Visualization and BI
Co-workers
Batch
B CA
Raw logs, files, assets, Google
Analytics data etc.
Real-time data ingestion(and at scale)
Breakthrough #2
Google confidential │ Do not distribute
Overview:Data to process: Data in the Consolidated Audit Trail (CAT). A data repository of all equities and options orders, quotes, and events
Challenges:How to process the CAT and organize 100 billion market events into an “order lifecycle” in a 4 hour windowStore 6 years (~30PB) of data
Cloud Bigtable to process and run queries and tolerate volume increases
6 BILLIONMARKET EVENTS
WRITTEN PER HOUR
1.7 GIGsPER SECOND
PER HOUR
6 TBs
10 BNWRITTEN
PER HOUR BURSTS
1.7 GIGABYTESPER SECOND
10 TERABYTESPER HOUR
Google confidential │ Do not distribute
https://www.youtube.com/watch?v=fqOpaCS117Q
Analytics at the speed of thought
(and at scale)
Breakthrough #3
Building what’s next 33
Scales automatically
No setup or administration
Stream up to 100,000 rows p/sec
Easily integrates with third-party software
Google BigQuerymakes complex data analysis simple
Confidential + Proprietary
Google BigQuery Performance Example ?
Running an inefficient regular expression over 100 billion rows in
less than 60 seconds
Source: https://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-bigquery-query
1000-core Hadoop Cluster = 2.5 hours
Before
Making ad hoc Queries with BigQuery < 5min
After
● 500+ Games● Hundreds of Analysts● Terabytes of Data Daily
Google BigQueryThe Power of Google Dremel for everyone
Storage Compute
Fast Ingest Query
Terabit Network
“Right at the start of the partnership we were able to reduce time to insight from 96 hours to 30 minutes by using BigQuery, allowing us to react in real time to customer needs and provide better service..”
Gary SandersHead of the bank's digital analytics function
https://www.finextra.com/newsarticle/28566/lloyds-partners-google-on-data-analytics
Machine learning for everyone
Breakthrough #4
Google Cloud Platform 4040
“
"Machine learning is a core, transformative way by which we're rethinking everything we're doing … we're thoughtfully applying it across all our products, be it search, ads, YouTube or Play."
Google confidential | Do not distribute
Applications that can see, hear and understand
Confidential & ProprietaryGoogle Cloud Platform 42
TensorFlow
Deep Learning technology currently powering over 100 Google services
Generalizable to vision, sound, text, video and other data
Runs on CPUs or GPUs, desktop, server, or mobile computing platforms.
Distributed via Apache 2.0 OSS license
Use your own data to train models
Google Cloud Platform Confidential & Proprietary 44
What Cloud Machine Learning Can Do
● Fully managed service
● Train using a custom Tensor Flow
graph
● Batch and online predictions, at scale
● Integrated Datalab experience
● Regression and classification tasks
Fully trained, easy to use Machine Learning models
CloudTranslate API
CloudVision API
CloudSpeech API
CloudVision API
LabelDetection
LandmarkDetectionOCR
LogoDetection
FaceDetection
Explicit Content
Detection
{"landmarkAnnotations": [
"description":"Arc de Triomphe","locations": [{"latLng": {
"latitude":48.873667,“longitude":2.295134}}],
"score":0.94231218]}
CloudSpeech API
Recognizes over 80 languages and variants
Can return text in real-time
Highly accurate, even in noisy environments
Access from any device
Powered by Google’s machine learning
Speech API Demo
Click for Demo
“What are you sinking about ? “
https://www.google.com/intl/en/chrome/demos/speech.html
Machine Learning Use Cases
Structured Data
Classification/ Regression● Customer Churn Analysis● Product Diagnostics● Forecasting
Recommendation● Content Personalization● Product X-Sells/Up-sells
Anomaly Detection● Fraud Detection● Asset Sensor Diagnostics● Log Metric Anomalies
Unstructured Data
Image Analytics● Identify damaged shipments● Explicit Content Classification
Text Analytics● Call Center log analysis● Language Identification● Topic Classification● Sentiment Analysis
cloud.google.com
Google Cloud Platform Confidential & Proprietary 52
Google’s Approach to
Cloud Security & Compliance
● Tens of thousands of custom built, homogenous systems
● Dozens of datacenters for redundancy● Data encryption in transit and at rest● Secure software development process● External security verifications● 500+ security engineers● 160+ academic research papers on security● Vulnerability Reward Program
We store our own data in this environment
SSAE-16SOC 1
SSAE-16SOC 2
SSAE-16SOC 3
ISO27001
HIPAA(BAA)
PCI DSS v3.0 FISMA FedRamp
GAE Complete Complete Complete Complete H2 15 Complete FISMA (Moderate) H2 15
GCS Complete Complete Complete Complete Complete Complete n/a H2 15
GCE Complete Complete Complete Complete Complete Complete n/a H2 15
Datastore Complete Complete Complete Complete H2 15 Complete n/a H2 15
Big Query Complete Complete Complete Complete Complete Complete n/a H2 15
Cloud SQL Complete Complete Complete Complete Complete Complete n/a H2 15
Genomics Complete Complete Complete Complete Complete n/a n/a H2 15
Apps Complete Complete Complete Complete Complete n/a GAFG only H2 15
Certifications
Google Cloud Platform Confidential & Proprietary 56
https://cloud.google.com/solutions/machine-learning-with-financial-time-series-data
Demo: Predicting the NYSE daily outcome
Google Cloud Platform Confidential & Proprietary 57
Get more info: Google Cloud for Financial Serviceshttps://cloud.google.com/solutions/finserv/