Date post: | 06-Jan-2017 |
Category: |
Technology |
Upload: | james-serra |
View: | 3,221 times |
Download: | 0 times |
Big Data: It’s all about the use casesJames SerraBig Data [email protected]
About Me Business Intelligence Consultant, in IT for 30 years Microsoft, Big Data Evangelist Worked as desktop/web/database developer, DBA, BI and DW architect and
developer, MDM architect, PDW/APS developer Been perm, contractor, consultant, business owner Presenter at PASS Business Analytics Conference and PASS Summit MCSE: Data Platform and Business Intelligence MS: Architecting Microsoft Azure Solutions Blog at JamesSerra.com Former SQL Server MVP Author of book “Reporting with Microsoft SQL Server 2012”
Use Cases (theory)
Use Cases (practice)
Popular Technologies
Topics
Popular Technologies
Topics
Harness the growing and changing nature of dataWhat is Big Data?
StreamingStructured
Challenge is combining transactional data stored in relational databases with less structured data
Big Data = All Data
Get the right information to the right people at the right time in the right format
Unstructured
“ ”
What is the Internet of Things?
Connectivity Data AnalyticsThings
IoT = sensor-acquired data
Using a Data Lake Modern Architecture
All data sources are considered
Leverages the power of on-prem technologies and the cloud for storage and capture
Native formats, streaming data, big data
Extract and load, no/minimal transform
Storage of data in near-native format
Orchestration becomes possible
Streaming data accommodation becomes possible
Refineries transform data on read
Produce curated data sets to integrate with traditional warehouses
Users discover published data sets/services using familiar tools
CRMERPOLTP LOB
DATA SOURCES
FUTURE DATA SOURCESNON-RELATIONAL DATA
EXTRACT AND LOAD DATA LAKE DATA REFINERY PROCESS (TRANSFORM ON READ)
Transform relevant data into data sets
BI AND ANALYTCIS
Discover and consume predictive analytics, data sets and other reports
DATA WAREHOUSE
Star schemas,viewsother read-optimized structures
What is Hadoop?
Microsoft Confidential
9
Distributed, scalable system on commodity HW
Composed of a few parts: HDFS – Distributed file system MapReduce – Programming model Other tools: Hive, Pig, SQOOP, HCatalog,
HBase, Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie, ZooKeeper, Flume, Storm
Main players are Hortonworks, Cloudera, MapR
WARNING: Hadoop, while ideal for processing huge volumes of data, is inadequate for analyzing that data in real time (companies do batch analytics instead)
Core Services
OPERATIONAL SERVICES
DATASERVICES
HDFS
SQOOP
FLUME
NFS
LOAD & EXTRACT
WebHDFS
OOZIE
AMBARI
YARN
MAP REDUCE
HIVE &HCATALOGPIG
HBASEFALCON
Hadoop Clustercompute
&storage . . .
. . .
. .compute
&storage
.
.
Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware
Can I use the cloud with my DW?• Public and private cloud• Cloud-born data vs on-prem born data• Transfer cost from/to cloud and on-prem• Sensitive data on-prem, non-sensitive in cloud• Look at hybrid solutions
MPP Logical Architecture“Compute” node Balanced
storageSQL“Control” node
SQL
“Compute” node Balanced storage
SQL
“Compute” node Balanced storage
SQL
“Compute” node Balanced storage
SQL
DMS
DMS
DMS
DMS
DMS
1) User connects to the appliance (control node) and submits query
2) Control node query processor determines best *parallel* query plan
3) DMS distributes sub-queries to each compute node
4) Each compute node executes query on its subset of data
5) Each compute node returns a subset of the response to the control node
6) If necessary, control node does any final aggregation/computation
7) Control node returns results to userQueries running in parallel on a subset of the data, using separate pipes effectively making the pipe larger
NoSQL databases• Non-relational databases (semi-structured data)• Types: Document, Key-value, Column, Graph• MongoDB, Cassandra, HBase, DocumentDB, Riak• Large-scale OLTP (i.e. popular web application)• Scale-out solution• High-availability• JSON data• Cons: data consistency, join data, use SQL, quick mass updates,
skillset• Bad solution for a data warehouse, but can have a place in a big
data solution• Polyglot Persistence: use the right tool for the job
Use Cases (theory)
Topics
Speed/Real-time
Batch/Traditional
Reporting Needs
Hybrid
Modern Data WarehouseThe Dream
All Source
s
EnterpriseData
Warehouse
The Reality
Let’s set off light bulbs in your head
Recommenda-tion engines
Smart meter monitoring
Equipment monitoring
Advertising analysis
Life sciences research
Fraud detection
Healthcare outcomes
Weather forecasting for business planning
Oil & Gas exploration
Social network analysis
Churn analysis
Traffic flow optimization
IT infrastructure & Web App optimization
Legal discovery and document archiving
Data Analytics is needed everywhere
Intelligence Gathering
Location-based tracking & services
Pricing Analysis
Personalized Insurance
The Internet of Things – ManufacturingGLOBAL OPERATIONS
I can see my production line status and recommend adjustments to better manage operational cost.
I know when to deploy the right resources for predictive maintenance to minimize equipment failures and reduce service cost.
I gain insight into usage patterns from multiple customers and track equipment deterioration, enabling me to reengineer products for better performance.
MANUFACTURING PLANT
Aggregate product data, customer sentiment, and other third-party syndicated data to identify and correct quality issues.
Manage equipment remotely, using temperature limits and other settings to conserve energy and reduce costs.
Monitor production flow in near-real time to eliminate waste and unnecessary work in process inventory.
GLOBAL FACILITY INSIGHT
Implement condition-based maintenance alerts to eliminate machine down-time and increase throughput.
THIRD-PARTY LOGISTICS
Provide cross-channel visibility into inventories to optimize supply and reduce shared costs in the value chain.
CUSTOMER SITE
Transmits operational information to the partner (e.g. OEM) and to field service engineers for remote process automation and optimization.
Management
R&D
Field Service
The Internet of Things – Oil & Gas
Utilize advanced 3D and 4D visualizations based on analytic algorithms to model subsurface geology
Production Manager
Onsite personnel
Establish near real-time communication and automatically publish events and alarms to the field to guide and protect onsite personnel and assets
Integrate all upstream data onto a unified platform to facilitate analytics, information sharing, and organizational transition
1. Exploration
2. Development
3. Drilling4. ProductionGeologist
Consolidate data from surveys, drill logs, and external sources to generate advanced reservoir models and production forecasts
Maximize recovery by monitoring near real-time production data and generating alerts for conditional maintenance needs
Combine near real-time drilling and seismic data to optimize drilling trajectories and recovery potential, while minimizing environmental risk
Operations Control Center
Find new hydrocarbon reservoirs quicker with seismic data uploaded to the cloud
and prepared for analysis
NORTH SHORE PRODUCTION
PHARMACY
The Internet of Things – Pharma
Customer Service
Monitor device data to make more timely health decisions, such as adjusting dosages
Enable advanced product tracking and authentication to prevent counterfeits
Develop better products, faster, informed by a much larger data set based on patient outcomes
R&D
Anticipate medical device maintenance needs, and alert patients to schedule a doctor visit for replacement or repair
Healthcare Provider
Monitor medical device functionality for better customer service, reduced risk, and insight to improve product designs
Manage equipment remotely, using appropriate KPIs
Reduce machine downtime with condition-based maintenance alerts
Patient Home
Distribution
Manufacturing
Aggregate and correlate data from disparate medical devices with medications and health outcomes for advanced insight
Producers Event Ingestion Storage Transformation Presentation & action
Event Hubs (Service Bus) SQL Database Machine
Learning Azure Websites
Heterogeneous client agents
Table/Blob Storage HD Insight Mobile Services
External Data Sources DocumentDB Stream
AnalyticsNotification Hubs
External Data Sources Cloud Services Power BI
External Services
Microsoft Azure services for IoT
Event Hubs (Service Bus)
Stream Analytics
SQL Database Azure Websites
Mobile Services
Notification Hubs
Power BI
External Services
Table/Blob Storage
DocumentDB{ }
HD Insight
Machine Learning
Hybrid
Use Cases (practice)
Topics
Manufacturing
Manufacturer of Automobiles
ManufacturerOne of the leading multinational automobile corporations that is one of the largest companies in the world by revenue. They manufacture over 10 million vehicles a year.
Part 1: What They Did | Produces Internet of Things insights for their automobiles
ChallengeNeeded to analyze the telemetry being emitted from their luxury car line in real-time.Wanted to build a scalable, reliable, and highly available solution that has the ability to receive and process a large volume of vehicle information and maintenance events
SolutionUse Azure Blob, HDInsight, Storm in HDInsight, HBase in HDInsight, Event Hubs, DocumentDB, Machine Learning, and Power BI Collect IoT data from automobiles:• Telemetry data comes in real-time• Able to process and generate insights around vehicle information and maintenance events
Internet of Things
BK1
Toyota
BK1
Manufacturer of Automobiles Part 2: How They Did It | Produces Internet of Things insights for automobiles
How They Did ItCollect data from automobiles• Send events in real-time to Event Hubs• Stored into Azure Blobs
Retrieve reference data and do predictive analytics• Get reference data stored in HBase• Run ML algorithms on the telemetry to predict outcomes
Store into queryable store DocumentDB• Stored in DocumentDB for Power BI to display as a
dashboard• Trigger Apache Storm in HDInsight to process and return
results back to the vehicles
Internet of Things
Cloud gateways
HDFS Store ML No SQL Store
Live Dashboard
Queuing Service
Event Hubs
Azure Blob HBase Azure ML DocumentDB
PowerBI
Event Hubs
Apache Storm on HDInsight
Queuing Service
Get Data Store in Blob
Get Reference
Data
Do Machine Learning
Store in Query able
Store
Power and Utilities& Oil and Gas
Industrial automation company partnering with multinational oil company Oil and GasLeading industrial automation company who employs over 20,000 people.partnering with Leading multinational oil and gas company (one of the six oil and gas super majors) who employs over 90,000 people.
Part 1: What They Did | IoT internet-connected sensors to generate analytics for proactive maintenance
ChallengeManage sites used for dispensing liquefied natural gas (clean fuel for commercial customers who do heavy-duty road transportation)Built LNG refueling stations across US interstate highwayStations are unmanned so they built 24x7 remote management and monitoring to track diagnostics of each station for maintenance or tuningBuilt internet-connected sensors embedded in 350 dispenser sites worldwide generating tens of thousands data points per second• Temperature, pressure, vibration, etc.Data needs outgrew company’s internal datacenter and data warehouse
SolutionChose Azure HDInsight, Data Factory, SQL Database, Machine LearningDashboards used to detect anomalies for proactive maintenance• Changes in performance of the components• Energy consumption of components • Component downtime and reliability Future: Goal is to expand program to hundreds of thousands of dispensers
IoT, Analytics
Rockwell Automation
BK1
Industrial automation company partnering with multinational oil companyPart 2: How They Did It | IoT internet-connected sensors to generate analytics for proactive maintenance
How They Did ItCollect data from internet-collected sensors• Tens of thousands data points per second• Interpolate time-series prior to analysis• Stored raw sensor data in Blobs every 5 minutesUse Hadoop to execute scripts and Data Factory to orchestrate• Hive and Pig scripts orchestrated by Data Factory• Data resulting from scripts loaded in SQL Database• Queries detect site anomalies to indicate
maintenance/tuningProduced dashboards with role-based reporting• Azure Machine Learning , SSRS, Power BI for O365• Provide users with customizable interface• View current and historical data (day-to-day operations,
asset performance over time, etc.)• Leveraged Azure Mobile Notification Hub for real-time
notifications, alarms, or important eventsUse Azure ML to predict • Understand which pumps, run at what speeds, maximized
water supply while minimizing energy use
IoT, Analytics
Store sensor data every 5 minutes• Temperature, pressure, vibration, etc.• Tens of thousands of data points / second
Azure HDInsight
Hive, Pig,
Data Factory
Azure SQL DB
Power BI for O365
Mobile Notification Hub
Mobile Device
Real-time notification
Azure Machine Learning
Azure Blobs
Government
Secretary of Finance and Public Credit - Government GovernmentGovernment organization that handles finances, taxes, budget, income, and national debt for their country.
Part 1: What They Did | Fraud and Money Laundering Detection
ChallengeThe government passed a law to have all invoice submission to be in electronic formatThe tax department allows clients to uploads their digital documents (pay stubs, expenditure slips) and now have 4 billion documents uploadedWant to get insights into the data to do analysis and identify trends and fraud and ensure compliance with tax obligationsSolutionBuilt electronic digital invoicing solution to upload invoices• Paystubs, expenditure slipsUse HDInsight to run queries and to process the electronic invoices to gain insightsNeeded to scale to a peak of 150+ million invoices uploaded / dayDo Fraud detection by understanding what people are doing to detect anomalies (ie. tax fraud, money laundering, etc.)Output of the system saved to SQL Server on-premises databases to run ad hoc queries
Fraud Detection
SAT
BK1
Secretary of Finance and Public Credit - Government Part 2: How They Did It | Fraud and Money Laundering Detection
How They Did ItStore electronic digital invoices as XML document in Azure Blobs• Store approximately 4 billion invoices total• Store 40 million – 180 million files every day• Data is stored as XML files with metadata information• Average size of each XML document is 5-10KBUse Azure HDInsight (>140 node clusters)• Do batch querying • Use Hive, Pig, and MapReduce• Hive external tables to make files queryable• Run once per day• Detect anomalies / fraudSend to SQL Server in IaaS VM and then to SQL Server On-premises• SQOOP data from Azure Blobs to SQL Server VMs• ETL to SQL Server on-premises• Do BI on top of SQL Server as a data mart
Fraud Detection
Website to submit electronic documents
Store 4 billion invoices totalAt peaks, 150M invoices submitted/day
Run invoices through a parser and write out to Blob storageData is stored as XML files
Hive, Pig, MapReduceTo detect anomalies/fraud
Use Hive external tables to make files queryable
HDInsight140+ node cluster
SQL ServerOn-premises
SQOOP
SQL ServerIn IaaS VM
ETL
BI for insights
Azure Blobs
Entertainment and Gaming
Game Development Company
GamingA predominantly mobile-based game development company. While they are a mid-sized organization, they have partnered with media giants on various gaming projects
Part 1: What They Did | In-game Analytics
ChallengeAs a game development studio, they wanted to do in-game analytics to understand their players more and what they do in the games
SolutionChose Azure HDInsight (MapReduce and Storm), Service Bus and also use SQL Server for reportingSwitched from Amazon AWS EMRCollects telemetry and logging data to gain in-game analytics:• How many players using the game• How many players invited their friends• How far along did players get into the tutorial• How many attempts did they make on one level/stage
In-game Analytics
Media tonic
BK1
Game Development Company Part 2: How They Did It | In-game Analytics
How They Did ItCollect data from games in Azure Blobs• Game sends telemetry/logging data as JSON files• Contains every action of user in the game• Data is pushed to Azure Service Bus as real-time• Tens of Gigabytes of data captured daily HDInsight picks up real-time data and processes• From Service Bus, HDInsight processes using Apache
Storm and MapReduce• Constantly running experiments to determine insight• A/B testing• In-game metrics and analytics• Spin up 32-node cluster nightly for four hoursOutput sent to SQL Server for BI• Transfer data to SQL Server for BI
In-game Analytics
Service Bus
Real-time Event
Azure BlobsAzure HDInsight
SQL ServerOn-premises
BI for insights
Non-Profit
JustGiving, Non-Profit
Non-profitJustGiving, a global online social platform for giving. It's a financial service (not a charity) that lets you "raise money for a cause you care about" through your network of friends. Their goal is to become "Facebook of Giving"
Part 1: What They Did | Recommendation Engine
ChallengeThey wanted to identify what was personal and relevant to people and what they cared about, so that they could suggest further causes that may inspire continual involvement.With 22 million customers this meant storing and processing huge amounts of data that their existing infrastructure simply couldn’t support.
SolutionChose SQL Server on-premises, Azure HDInsight, Blobs, Tables, Cache, and Service BusDeployed a network of “social giving” for people to make it a group activity to support a cause• Built a way to inform givers a charity goal based on a person’s position in their social graph• Help identify causes that a user might be interested in (based on demographics, and their
social graph)• Recommend people to add to their social graph as well as other charitable causes
Recommendation
Just Giving
JustGiving, Non-ProfitPart 2: How They Did It | Recommendation engine
How They Did ItCollect data in Azure Blobs• Move data from SQL Server through an Agent to Azure
Blobs
HDInsight processes data for insights• Input data is 20-30GB / job• Use MapReduce jobs to create a graph• Further job to denormalize activity feeds for all users• Generates an activity recommendation
Generates a real-time recommendation• Real-time activity feeds/events coming in from Service
Bus (~50 events/second)• Activity recommendation coming out of daily HDInsight
job• Sent to web-site
Recommendation
SQL ServerOn-premises
Agent
Azure Blobs
Azure HDInsight
ActivityFeeds
Give Graph
Azure TablesWeb APIWebsite +
Event store
Service Bus
Real-time Event
Serves results
Azure Cache
Resources The Modern Data Warehouse: http://bit.ly/1xuX4Py Should you move your data to the cloud? http://bit.ly/1xuXbKU Presentation slides for Modern Data Warehousing: http://bit.ly/1xuXcP5 Presentation slides for Building an Effective Data Warehouse Architecture:
http://bit.ly/1xuXeX4 Hadoop and Data Warehouses: http://bit.ly/1xuXfu9
Q & A ?James Serra, Big Data EvangelistEmail me at: [email protected] me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck will be posted)