Post on 16-Dec-2014
description
transcript
What is Big Data and why should I care?
James Serra, PDW Technology Solution Professional08/26/14
About Me Business Intelligence Consultant, in IT for 28 years Microsoft, PDW Technology Solution Professional (TSP) Owner of Serra Consulting Services, specializing in end-to-
end Business Intelligence and Data Warehouse solutions using the Microsoft BI stack
Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW developer
Been perm, contractor, consultant, business owner Presenter at PASS Business Analytics Conference and PASS
Summit MCSE for SQL Server 2012: Data Platform and BI SME for SQL Server 2012 certs Contributing writer for SQL Server Pro magazine Blog at JamesSerra.com SQL Server MVP Author of book “Reporting with Microsoft SQL Server 2012”
How can big data help me?Being able to extract data from various sources across the enterprise and outside the enterprise and then transform it all into key business insights can provide a significant competitive advantage by making better business decisions
- It all comes down to: The more data you have, the better business decisions you can make- First step is to understand the importance of a data warehouse- You need to understand what big data is- You need to make sure your data warehouse can handle big data (do you have a big data
problem?)- You need examples of how big data can help you- You need to understand Hadoop and its use cases with a data warehouse- You need to understand the difference between scaling up (SMP) and scaling out (MPP)- Understand the limitations of a traditional modern data warehouse and then build a modern data
warehouse
Why use a Data Warehouse?
5
Legacy applications + databases = chaos
Production Control
MRP
InventoryControl
Parts Management
Logistics
Shipping
Raw Goods
Order Control
Purchasing
Marketing
Finance
Sales
Accounting
Management Reporting
Engineering
Actuarial
Human Resources
ContinuityConsolidationControlComplianceCollaboration
Enterprise data warehouse = order
Single version of the truth
Enterprise DataWarehouse
Every question = decision
Two purposes of data warehouse: 1) save time building reports; 2) slice in dice in ways you could not do before
What is a Data Warehouse and why use one?All these reasons are for data warehouses only (not OLTP): Reduce stress on production system Optimized for read access, sequential disk scans Integrate many sources of data Keep historical records (no need to save hardcopy reports) Restructure/rename tables and fields, model data Protect against source system upgrades Use Master Data Management, including hierarchies No IT involvement needed for users to create reports Improve data quality and plugs holes in source systems One version of the truth Easy to create BI solutions on top of it (i.e. SSAS Cubes)
6
The traditional data warehouse
7
… data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing.
– Gartner, “The State of Data Warehousing in 2012”
Data sources
OLTP ERP CRM LOB
ETL
Data warehouse
BI and analytics
Will your current solution handle future needs?
An illustration of the velocity of data created
Kalakota, R. (2012, October 22). Sizing “Mobile + Social” Big Data Stats. Retrieved from http://practicalanalytics.wordpress.com/
Social and web analytics
Live data feeds
Advanced analytics
The three V’s
Megabytes
What is big data and why is it valuable to the business A evolution in the nature and use of data in the enterprise
Data complexity: variety and velocity
Peta
byte
s/Volu
me
What is a data scientist?Excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge
- Evolution from data analyst role- Strong business acumen- Part analyst, part artist- Good with data modeling, machine learning, data mining- Azure ML, SAS and R
Big data defined: Is it the size of the data?- Volume- Quantity- Big data does not just mean the size of the data
Big data defined: Is it the frequency of the data?- Velocity- The rate at which the data changes
Big data defined: Is it the type of data?- Variety- Different types of data such as audio, video, text- Structured (from a relational database)- Unstructured (videos, pictures, PDF document, email)- Semi-structured (twitter feed, Facebook, XML, Excel)- Variability: The different meanings/contexts associated with a given piece of data
Why do I need data in a relational format?• Creation of metadata
• To join multiple tables/files via a column• Referential integrity• Constraints• Default values• Optimizations, Indexes• Transactions• Use of SQL• User authentication and access (security)• Updating and maintenance, consistency, reliability
Big data defined: Is it the performance of the data?- Are you using a dashboard (slice and dice) or a operational reporting tool?- What is the Service Level Agreement (SLA)?
Questions to see if you have a big data problem
17
1
Qualification QuestionsIs your data volume growth becoming unmanageable using currently implemented DW technologies? (>20-30% annually)
2
Is there a specific Big Data business need (e.g. social media analysis, fraud detection) in a high-priority industry (Retail, Financial, Pub Sec)?
3Is your DW or storage spend consuming a disproportionate and increasing amount of your IT budget?
4Do your business users need to find, combine, and refine structured and unstructured data? Internal and external sources?
5In the near future do you expect to need both on-premise and cloud-based BI capabilities?
6Do you have a need to capture and analyze streaming data? At what scale and velocity?
7
8
9
10
Do you currently (or plan to) collect, store, and analyze multiple forms of unstructured data (XML, JSON, CSV, etc.)?
Are you able to serve your business users’ analytics provisioning and data requests in a timely manner?
Are you experiencing data management issues such as security or compliance due to business owners (“shadow” IT) creating their own unmanaged data stores?
Are you trying to build, grow, and manage your next-generation DW without adding new headcount or talent (data scientists, external consultants, etc.)?
Examples of when big data has become a problem?- When queries are slow- When you run out of disk space- When your data warehouse can’t import certain types of data- When your maintenance window gets overrun- When you are not able to give the users data more frequently- When you can’t integrate with cloud data
Using “Big” data to complete the picture
1Social media: customer sentiment
2Bike sensors: complete journey
3Bus GPS: React to traffic
4Wi-Fi: customer movement in stations
What is Hadoop?
Microsoft Confidential
20
Distributed, scalable system on commodity HW
Composed of a few parts:
HDFS – Distributed file system
MapReduce – Programming model
Other tools: HBase, R, Pig, Hive, Flume, Mahout, Avro, Zookeeper
Main players are Hortonworks and Cloudera
Core Services
OPERATIONAL SERVICES
DATASERVICES
HDFS
SQOOP
FLUME
NFS
LOAD & EXTRACT
WebHDFS
OOZIE
AMBARI
YARN
MAP REDUCE
HIVE &HCATALOGPIG
HBASEFALCON
Hadoop Cluster
compute&
storage . . .
. . .
. .compute
&storage
.
.
Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware
The “Expanded” Hadoop Ecosystem
Hadoop benefits• Provides storage for big data at a reasonable cost, since it is build
around commodity hardware• Provides a robust environment as it was designed to provide a fault-
tolerant environment and high throughput for extremely large datasets
• Allows for the capture of new or more data such as unstructured, semi-structured, and structured in batch or real-time
• Data can be stored longer, so you no longer have to purge older data• Provides scalable analytics via distributed storage and distributed
processing• Provides rich analytics via support for languages such as Java, Mahout,
Ruby, Python, and R
Reasons not to use Hadoop as your DW• Hadoop is slow for reading queries. HDP 2.0 today will not perform anywhere near
PDW for interactive querying. This is why PolyBase is so important, as it bridges the gap between the two technologies so customers can take advantage of both the unique features of Hadoop and realize the benefits of a EDW. Truth be told users won’t want to wait 20+ seconds for a MapReduce job to start up to execute a Hive query
• Hadoop is not relational, as all the data is in files in HDFS, so there always is a conversion process to convert the data to a relational format
• Hadoop is not a database management system. It does not have functionality such as update of data, referential integrity, statistics, ACID compliance, data security, and the plethora of tools and facilities needed to govern corporate data assets
• There is no metadata stored in HDFS, so another tool needs to be used to store that, adding complexity and slowing performance
• Finding expertise in Hadoop is very difficult: The small number of people who understand Hadoop and all its various versions and products versus the large number of people who know SQL
• Super complex, lot’s of integration with multiple technologies to make it work• Many tools/technologies/versions/vendors, no standards• Some reporting tools don’t work against Hadoop
What is a data lake?
Large object-based storage repository that holds data in its native format until it is needed.
• A place to store unlimited amounts of data in any format inexpensively• Usually Hadoop• A way to describe any large data pool in which the schema and data
requirements are not defined until the data is queried• Also called bit bucket or landing zone
Select… Result set Provides a single T-SQL query model (“semantic layer”) for APS and Hadoop with rich features of T-SQL, including joins without ETL
Query Hadoop data with T-SQL using PolyBaseBringing the worlds or big data and the data warehouse together for users and IT
SQL ServerParallel DataWarehouse
Cloudera CHD Linux 4.6
Hortonworks HDP 2.1 (Windows, Linux)
Windows AzureHDInsight 2.4 (HDFS)
PolyBase
Microsoft HDInsightHDP 1.3
(2.0 in AU2)Query re la t i ona l + non
re la t i ona l
Others (SQL Server, DB2, Oracle)? True federated query engine
AU1: Windows Azure storage blob (WASB)
Use cases where PolyBase simplifies using Hadoop dataBringing islands of Hadoop data together
High performance queries against Hadoop data
Archiving data warehouse data to Hadoop (move)
Exporting relational data to Hadoop (copy)
Importing Hadoop data into data warehouse (copy)
Big Data Landscape
Big Data Landscape (Version 2.0)
What is the Internet of Things (IoT)?Internet-connected devices that can perceive the environment in some way, share their data, and communicate with you
- Has it one processor and sensor to collect information- Examples: heart monitoring implants, biochip transponders on farm animals, automobiles with
build-in sensors, field operation devices that assist firefighters in search and rescue- Excludes computers, tablets, and smart phones
Cool possibilities- When a milk carton is almost empty it will ping you when you are near a store- An alarm clock that signals your coffee maker to start brewing when you wake up- An embedded chip that monitors your vital signs and notifies a medical provider if exceeds limit
What is SMP and MPP?This is a Data Warehouse and MPP (massively parallel processing) solution and not a OLTP (online transaction processing) and SMP (symmetric multiprocessing) solution. SMP is one server where each CPU in the server shares the same memory, disk, and network controllers (scale-up). MPP means data is distributed among many independent servers running in parallel and is a shared-nothing architecture, where each server operates self-sufficiently and controls its own memory and disk (scale-out).
When do you need a MPP solution?
- We need at least 3x performance improvement- We are near disk capacity and see a lot of growth in the upcoming years- We need to support queries during our maintenance window- We need to load data outside of our maintenance window- We want to make non-relational data part of our data warehouse- We will spend a lot of money for FusionIO cards, SSDs, more SAN space, more memory, faster cpu
How to “break” the traditional data warehouse
31
Data sources
OLTP ERP CRM LOB
ETL
Data warehouse
BI and analytics
Increasing data volumes
1
Real-time Performance/Data
2
Non-Relational Data
Devices
Web Sensors
Social
New data sources & types
3
Cloud-born data
4
INFRASTRUCTURE
DATA MANAGEMENT & PROCESSING
DATA ENRICHMENT AND FEDERATED QUERY
BI & ANALYTICS
Self-service CollaborationCorporate PredictiveMobile
Extract, transform, load
Single query model Data quality Master data
management
Non-relationalRelational Analytical Streaming Internal & External
Data sources
OLTP ERP CRM LOB
Non-relational data
Devices
Web Sensors
Social
Modern data warehouse defined
SOURCE DATA
STAGING HADOOP
STAGINGRDBMS
DATA WAREHOUSE
OLAP USER PRESENTATION
Big Data
UnstructuredData (Word Docs, Blobs, Logs)
Semi-StructuredData (XML, JSON)
Structured Data(.TXT, CSV, Delimited)
OtherSocial Media, Sensors, Devices
Hadoop Ecosystem
Staging DB
SQL Server Analytical Services
APS/HDI APS/PDW
ODS
EDW
Polybase
Polybase
Polybase
The Microsoft Modern Data Warehouse
Introducing the Microsoft Analytics Platform SystemYour turnkey modern data warehouse appliance
Next-generation performance at scale
Enterprise-ready big data
Engineered foroptimal value
• Relational and non-relational data in a single appliance
• Or, integrate relational data with non-relational data in an external Hadoop cluster on premise or data stored in the Cloud (hot, warm, cold)
• Enterprise-ready Hadoop
• Integrated querying across Hadoop and APS using T-SQL (PolyBase)
• Direct integration with Microsoft BI tools such as Power BI
• Near real-time performance with In-Memory
• Scale-out to accommodate your growing data or to increase performance (2-nodes to 56-nodes)
• Remove SMP DW bottlenecks with MPP SQL Server
• No rip and replace when more performance needed
• No performance tuning required
• Concurrency that fuels rapid adoption
• Industry’s lowest DW price/TB
• Value through a single appliance solution
• Value with flexible hardware options using commodity hardware
• Free up space on SAN (cost averages 10k per TB)
© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Questions?
James Serrajserra@microsoft.com
Blog about PDW topics: http://www.jamesserra.com/archive/category/pdw/