Date post: | 01-Nov-2014 |
Category: |
Technology |
Upload: | michal-zylinski |
View: | 906 times |
Download: | 1 times |
2Sources: The Economist, Feb ‘10; IDC
By 2016 the New Large Synoptic Survey Telescope in Chile will acquire 140 terabytes in 5 days - more than Sloan
acquired in 10 years
In 2000 the Sloan Digital Sky Survey collected more data in its 1st week than was collected in the entire history of
Astronomy
The Large Hadron Collider at CERN generates 40 terabytes of data every second
3Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp
Cisco predicts that by 2013 annual internet traffic flowing will reach 667 exabytes
The Twitter community generates over 1 terabyte of tweets every day
Bing ingests > 7 petabyte a month
1,800,000,000,000,000,000,000 bytes
1,8 ZBThe size of Digital Universe in 2011
2010 2011 2012 20150123456789
Within 24 months # of intelligent devices > traditional IT devices
Sources: IDC Digital Universe Study 2011, Worldwide Big Data Technology and Services 2012–2015 Forecast
In 2015 nearly
20% of the information will be touched by cloud
Modeling True RiskThreat AnalysisFraud DetectionTrade SurveillanceCredit scoring and analysis
Financial Services
Retail
Point of Sales Transaction AnalysisCustomer Churn AnalysisSentiment Analysis
Recommendation EnginesAd TargetingSearch QualityAbuse and click fraud detection
E-Commerce
TelecommunicationsCustomer Churn PreventionNetwork Performance optimizationCall Detail Record (CDR) AnalysisAnalyzing Network to Predict Failure
RUNTIME
So how does it work?SECOND, TAKE THE PROCESSING TO THE DATA
// Map Reduce function in JavaScript
var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {
if (words[i] !== "") {context.write(words[i].toLowerCase(), 1);}}};
var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());
}context.write(key, sum);};
CodeCodeCodeCode
Hadoop in detailAnalysis of semi and unstructured data distributed across a commodity cluster
Based on Google’s MapReduce paper and Google File system (GFS)
Programs = Sequence of “map” and “reduce” tasks.
Simplify writing distributed applications
Highly fault tolerant – multiple copies
Move computation close to data
Implemented in Java and optimized for Linux
Traditional RDBMS MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
DBA Ratio 1:40 1:3000
Hadoop Ecosystem
MapReduce (Job Scheduling/Execution System)
HDFS(Hadoop Distributed File System)
HBase (Column DB)
Pig (Data Flow)
Hive (Warehouse and Data
Access)
Oozie(Workflow)
Sqoop
Traditional BI ToolsHBase / Cassandra
(Columnar NoSQL Databases)
Avro
(Seria
lizatio
n)
Zookeep
er
(C
oord
inati
on)
Apache Mahout
Karmasphere
(Development Tool)
Hadoop = MapReduce + HDFS
Flume
Hadoop + MicrosoftOur own distribution of Hadoop
• Submit changes back to Apache Foundation
• Download for free
Optimized for Windows & Azure
• AD & Systems Center integration
• Hadoop-as-a-service-on-Azure
Focus on .NET Developers
• Integration with Visual Studio
• Support for C#
Differentiation through Enterprise Readiness
• Performance and Scale• High Availability• Ease of use
Why Hadoop as a Service?• Task based billing• Easy admin• Zero install• Support a wide variety of job types–Machine Learning (mahout), Graph
Mining (Pegasus), HIVE, Pig, Java, JS, etc.
• Greatly simplified UI
cheap fast
UNIX Pipes
cat [input_file] | [mapper] | sort | [reducer] >[output_file]
Hadoop Streaminghadoop jar lib\hadoop-streaming.jar
-input directory -output directory -mapper any script or executable -reducer any script or executable
Data Market integration
Integration with third-party data and services
Mashing up of internal and public data sets
Integration with Windows Azure Marketplace through ODATA
Sharing of data and insights through Windows Azure Marketplace
Bene
fits
Key
Feat
ures
Some other fancy stuff...
Microsoft Codename "Social Analytics"
Integration of social media data with business applications
Integration with social media sites
Stronger customer relationships
Models augmented with publicly available data from social media sites
Bene
fits
Key
Feat
ures
NON RELATIONAL
100111
DATA MANAGEMENT
SHAREAND GOVERN
DISCOVERAND RECOMMEND
TRANSFORMAND CLEAN
ANALYTICS
DATA ENRICHMENT
OPERATIONAL
SELF-SERVICE MOBILE
PREDICTIVE
REAL-TIME
COLLABORATIVE
MARKETPLACE
Exte
rnal
Data
and
Serv
ices
RELATIONAL MULTIDIMENSIONAL STREAMING
Reality check A.D. 2012
Microsoft BI Tools
Hadoop Distribution
24 TB Cube
Use Case:
• Extremely large volume of unstructured web log analysis
• Ad hoc analysis of unstructured web logs to prototype patterns
• Hadoop data feeds large 24TB Cube