+ All Categories
Home > Technology > Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

Date post: 10-Nov-2014
Category:
Upload: mark-kromer
View: 1,189 times
Download: 1 times
Share this document with a friend
Description:
These are my slides from May 2013 Philly Code Camp at Penn State Abington. I will post the samples, code and scripts on my blog here following the event this Saturday: http://www.kromerbigdata.com
Popular Tags:
22
Big Data with SQL Server Philly Code Camp 2013.1 May 2013 http://www.pssug.org Mark Kromer http://www.kromerbigdata.com @kromerbigdata @mssqldude [email protected]
Transcript
Page 1: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

Big Data with SQL Server

Philly Code Camp 2013.1May 2013http://www.pssug.org

Mark Kromerhttp://www.kromerbigdata.com@kromerbigdata@[email protected]

Page 2: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

‣What is Big Data?

‣The Big Data and Apache Hadoop environment

‣Big Data Analytics

‣SQL Server in the Big Data world

‣Microsoft + Hortonworks (Yahoo!) = HDInsights

What we’ll (try) to cover today

2

Page 3: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

Big Data 101

‣ 3 V’s

‣ Volume – Terabyte records, transactions, tables, files

‣ Velocity – Batch, near-time, real-time (analytics), streams.

‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix

‣ Text Processing‣ Techniques for processing and analyzing unstructured (and structured)

LARGE files

‣ Analytics & Insights

‣ Distributed File System & Programming

Page 4: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

‣ Batch Processing

‣ Commodity Hardware

‣ Data Locality, no shared storage

‣ Scales linearly

‣ Great for large text file processing, not so great on small files

‣ Distributed programming paradigm

Page 5: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

Popular Hadoop Distributions

Hosted PaaS Hadoop platforms: Amazon EMR, Pivotal, Microsoft Hadoop on Azure

Page 6: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

‣ Big Data ≠ NoSQL

‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing

‣ Facebook, for example, uses Hbase from the Hadoop stack

‣ Big Data ≠ Real Time

‣ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value

‣ Use in-memory analytics for real time insights

‣ Big Data ≠ Data Warehouse

‣ I still refer to large multi-TB DWs as “VLDB”

‣ Big Data is about crunching stats in text files for discovery of new patterns and insights

‣ Use the DW to aggregate and store the summaries of those calculations for reporting

Mark’s Big Data Myths

Page 7: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

Big Data Analytics Web Platform - Example

Data Source

s

Data M

asterin

g

Data

Warehouse

&

Analytics

Prese

ntatio

n

AttributionSegmentation

Stacking Effect

Media Level Data WarehouseAudience Level

Data WarehouseBig Data

SandboxesData Mapping

Business RulesExternal &

Extended Data

Tableau, SAS &Excel

MapReduceJobs

Page 8: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

using Microsoft.Hadoop.MapReduce;

using System.Text.RegularExpressions;

public class TotalHitsForPageMap : MapperBase

{

public override void Map(string inputLine, MapperContext context)

{

context.Log(inputLine);

var parts = Regex.Split(inputLine, "\\s+");

if (parts.Length != expected) //only take records with all values

{

return;

}

context.EmitKeyValue(parts[pagePos], hit);

}

}

MapReduce Framework (Map)

Page 9: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

public class TotalHitsForPageReducerCombiner : ReducerCombinerBase

{

public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)

{

context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString());

}

}

public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner>

{

public override HadoopJobConfiguration Configure(ExecutorContext context)

{

var retVal = new HadoopJobConfiguration();

retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT");

retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT");

retVal.DeleteOutputFolder = true;

return retVal;

}

}

MapReduce Framework (Reduce & Job)

Page 10: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

‣ Linux shell commands to access data in HDFS

‣ Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv

‣ List files in HDFS:

‣ c:\Hadoop>hadoop fs -ls /import

Found 1 items

-rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv

‣ View file in HDFS:c:\Hadoop>hadoop fs -cat /import/sales.csv

Kromer,123,5,55

Smith,567,1,25

Jones,123,9,99

James,11,12,1

Johnson,456,2,2.5

Singh,456,1,3.25

Yu,123,1,11

‣ Now, we can work on the data with MapReduce, Hive, Pig, etc.

Get Data into Hadoop

Page 11: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

create external table ext_sales

(

  lastname string,

  productid int,

  quantity int,

  sales_amount float

)

row format delimited fields terminated by ',' stored as textfile location '/user/makromer/hiveext/input';

LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales;

Use Hive for Data Schema and Analysis

Page 12: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1

‣ > hadoop fs -cat /user/mark/customers/part-m-00000

‣ > 5,Bob Smith

‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3

‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec)

‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.

SqoopData transfer to & from Hadoop & SQL Server

Page 13: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

SQL Server Big Data – Data Loading

Amazon HDFS & EMR

Data Loading

Amazon S3 Bucket

Page 14: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

Role of NoSQL in a Big Data Analytics Solution

‣ Use NoSQL to store data quickly without the overhead of RDBMS

‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few

‣ Why NoSQL?

‣ In the world of “Big Data”

‣ “Schema later”

‣ Ignore ACID properties

‣ Drop data into key-value store quick & dirty

‣ Worry about query & read later

‣ Why NOT NoSQL?

‣ In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface

‣ SQL Server and NoSQL

‣ Not a natural fit

‣ Use HDFS or your favorite NoSQL database

‣ Consider turning off SQL Server locking mechanisms

‣ Focus on writes, not reads (read uncommitted)

Page 15: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

‣ SQL Server Database‣ SQL 2012 Enterprise Edition

‣ Page Compression

‣ 2012 Columnar Compression on Fact Tables

‣ Clustered Index on all tables

‣ Auto-update Stats Asynch

‣ Partition Fact Tables by month and archive data with sliding window technique

‣ Drop all indexes before nightly ETL load jobs

‣ Rebuild all indexes when ETL completes

‣ SQL Server Analysis Services‣ SSAS 2012 Enterprise Edition

‣ 2008 R2 OLAP cubes partition-aligned with DW

‣ 2012 cubes in-memory tabular cubes

‣ All access through MSMDPUMP or SharePoint

SQL Server Big Data Environment

Page 16: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

‣Columnstore

‣Sqoop adapter

‣PolyBase

‣Hive

‣In-memory analytics

‣Scale-out MPP

SQL Server Big Data Analytics Features

Page 17: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

17 17

Sensors Devices Bots CrawlersERP CRM LOB APPs

Unstructured and Structured Data

Parallel Data Warehouse

Hadoop On Windows

Azure

Hadoop On Windows

ServerConnectors

S S RS

SSAS

BI Platform

Familiar End User ToolsExcel with PowerPivot

Embedded BIPredictive Analytics

Data Market Place

Data Market

Petabytes of Data (Unstructured)

Hundreds of TB of Data (structured)

Microsoft’s Data Solution – Big Data & PDW

Page 18: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

MICROSOFT BIG DATA

Discover Combine Refine

Relational Non-relational Streaming

immersive data

experiences

connecting with worlds data

any data, any

size, anywhere

Self-Service Collaboration Corporate Apps Devices

Analytical

Parallel Data Warehouse

Microsoft HDInsight Server

HDInsight Service

StreamInsight

PowerPivot Power View

Page 19: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

Windows Azure HDInsight Service

Microsoft HDInsight Server

Expanded Partnership

Page 20: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

Microsoft .NET Hadoop APIs

‣ WebHDFS

‣ Linq to Hive

‣ MapReduce

‣ C#

‣ Java

‣ Hive

‣ Pig

‣ http://hadoopsdk.codeplex.com/

‣ SQL on Hadoop

‣ Cloudera Impala

‣ Teradata SQL-H

‣ Microsoft Polybase

‣ Hadapt

Page 21: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

Data Movement to the Cloud

‣Use Windows Azure Blob Storage• Already stored in 3 copies

• Hadoop can read from Azure blob storage

• Allows you to upload while using no Hadoop network or CPU resources

‣Compress files• Hadoop can read Gzip

• Uses less network resources than uncompressed

• Costs less for direct storage costs

• Compress directories where source files are created as well.

21

Page 22: Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

‣ What is a Big Data approach to Analytics?

‣ Massive scale

‣ Data discovery & research

‣ Self-service

‣ Reporting & BI

‣ Why do we take this Big Data Analytics approach?

‣ TBs of change data in each subject area

‣ The data in the sources are variable and unstructured

‣ SSIS ETL alone couldn’t keep up or handle complexity

‣ SQL Server 2012 columnstore and tabular SSAS 2012 are key to using SQL Server for Big Data

‣ With the configs mentioned previously, SQL Server works great

‣ Analytics on Big Data also requires Big Data Analytics tools

‣ Aster, Tableau, PowerPivot, SAS, Parallel Data Warehouse

Wrap-up


Recommended