Date post: | 22-Nov-2014 |
Category: |
Internet |
Upload: | giivee-the |
View: | 159 times |
Download: | 1 times |
Agenda• Big data will change the world?
• What is Spark?
• Demo (Start a spark cluster / Word Count)
• Break : 10min
• Spark Concept
• Demo (ETL / MLib)
• Q&A
Who am I? • Wisely Chen ( [email protected] )
• Sr. Engineer in Yahoo![Taiwan] data team
• Loves to promote open source tech
• Hadoop Summit 2013 San Jose
• Jenkins Conf 2013 Palo Alto
• Spark Summit 2014 San Francisco
• Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012
Taiwan Data Team
Data!Highway
BI!Report
Serving!API
Data!Mart
ETL /Forecast
Machine!Learning
Big data will change the world?
Sensor
Data
Machine Learning
Robot
More Sensor
Data
Machine Learning
Robot
Human Action Data
Machine LearningRobot
What is sensor?
Internet of thing
More Sensor • 2000
• Browser
• Digital Camera(Photo)
• 2000~2014
• Browser
• Mobile (GPS,More Photo,Video)
• Wearable Device(Pulse)
• Google Glass(More Video)
• Internet of thing(………)
More Sensor
Bigger Data
Machine Learning
Robot
More Sensor • 2000
• Browser
• Digital Camera(Photo)
• 2000~2014
• Browser
• Mobile (GPS,More Photo,Video)
• Wearable Device(Pulse)
• Google Glass(More Video)
• Internet of thing(………)
Technology Improve• Sloan Digital Sky Survey(SDSS) collected more data in its
first few weeks than had been amassed in the entire history of astronomy.
• The Large Synoptic Survey Telescope in Chile, due to come on stream in 2016, will acquire that quantity of data every five days.
New Area
30min Zebra fish experiment = 1TB http://research.janelia.org/zebrafish/
Hadoop handle big data well
• 18M of hadoop related jobs on Yahoo Grid
• Yahoo handle over 440 PB data daily
• Most of job are ETL/SQL/BI
EBay’s data volume 2015 : 130EB 2020 : 4000ZB
Vadim Kutsyy “Data Science Empowering Personalization” in Big data innovation Summit 2014 Boston
Data is not only bigger
• We have more area of data
• More Sensor
• Sensor technology improve
• New area
More Sensor
Bigger Data
Better Machine Learning
Robot
Word Grammar Check• MS researcher Michele and Eric try to improve
grammar check algorithm
• They took four algorithm and feed in 10M, 100M and 1B words
• In 10M words, sophisticated algorithm(86%) works bester than simpler algorithm(75%)
• In 1B words, simpler algorithm(95%+) improved a lot, even better than sophisticated algorithm(94%)
–Google AI guru Peter Norvig, "The Unreasonable Effectiveness of Data”
“Simple models and a lot of data trump more elaborate models based on less data”
In translate area
Different type of data• In Harvard Data Mining Class, two team do Netflix
recommendation challenge
• Team A came up with a very sophisticated algorithm using the Netflix data.
• Team B used a very simple algorithm, but they added in additional data beyond the Netflix set
• Team B got much better results, close to the best results on the Netflix leaderboard
Taiwan Shopping User Analysis
Man tend to view underwear. But they don’t buy it
Male Users’ Top5 View Categories 1. Computer 2. Camera 3. ………. 4. ………. 5. Woman Underwear
2 types of dataTraffic Data
Transaction Data
User’s views / clicks “Weak intention” Large amount
User’s checkout “Strong Intention”
Small amour
Small Data
Sophisticated Algo
OK Result
Big Data
Simple Algo
Better Result
Data Set 1
Smart Model which leverage
more area of data
Data Set 2
Best Result
More Sensor
Bigger Data
Better Machine Learning
Helpful Robot
Robot• FoxCon’s robot can replace 70% worker
• Google driverless car/big dog
• Amazon warehouse robot
More Sensor
Bigger Data
Better Machine Learning
Helpful Robot
It is not movie it is happening
Sensor
Data
Machine Learning
Robot User Behavior
Recommendation Algorithm
Recommendation to user
(1/3 sales are from recommendation module)
Amazon layoff the editor team and replaced by recommendation algorithm
Sensor
Data
Machine Learning
RobotWeather Humidity
Sun ….
Give more water to area A
Sensor
Data
Machine Learning
Robot
DNNresearch, Behavio, Wavii, Flutter, autofuss, DeepMind,
spider.io, Adometry, QQuest Visual, Jetpac
Talaria, Stackdriver SCHAFT, Industrial Perception, Redwood Robotics,
Meka Robotics, Holomni, Bot & Dolly,
Boston Dynamics, Titan Aerospace,
Nest Lab, MyEnergy, Skybox Imaging, Dropcam,
Google buy 47 company at 13,14
IOT
Google is top 1 leader in big data
24 company on the ring
Sensor
Data
Machine Learning
Robot
Sensor
Data
Machine Learning
Robot
Be part of it!!!
The ring will change the world and
big data is the core of ring
Sensor
Data
Machine Learning
Robot
Hadoop
Sensor
Data
Machine Learning
Robot
Hadoop
Not so well
Hadoop is not good in machine learning
• Efficiency
• Difficult
• Data Engineer
• Data Scientist
• Data Analyst
• Algorithm: it is not so easy to parallelize your algorithm
Sensor
Data
Machine Learning
Robot
Hadoop is not good in machine learning
• Efficiency
• Difficult : Data scientist don’t know how to do
• Algorithm: it is not so easy to parallelize your algorithm
3X~25X than MapReduce framework !
From Matei’s paper: http://0rz.tw/VVqgP
Logistic regression
Runn
ing
Tim
e(S)
0
20
40
60
80
MR Spark3
76
KMeans
0
27.5
55
82.5
110
MR Spark
33
106
PageRank
0
45
90
135
180
MR Spark
23
171
Hadoop is not good in machine learning
• Efficiency
• Difficult
• Algorithm: it is not so easy to parallelize your algorithm
Data Analyst
Data Engineer
Data Scientist
Language Support• Python : Data Scientist , Data Engineer
• Java : Data Engineer
• Scala : Data Engineer
• SQL : Data Scientist, Data Analyst , Data Engineer
• R : Data Scientist, Data Analyst
• (will be official support in 1.2)
Python Word Count• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" ")) \
• .map(lambda word: (word, 1)) \
• .reduceByKey(lambda a, b: a + b)
• counts.saveAsTextFile("hdfs://...")
Access data via Spark API
Process via Python
Scala Word Count• val file = spark.textFile("hdfs://...")
• val counts = file.flatMap(line => line.split(" "))
• .map(word => (word, 1))
• .reduceByKey(_ + _)
• counts.saveAsTextFile("hdfs://...")
Java Wordcount• JavaRDD<String> file = spark.textFile("hdfs://...");
• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()
• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
• });
• JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()
• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
• });
• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()
• public Integer call(Integer a, Integer b) { return a + b; }
• });
• counts.saveAsTextFile("hdfs://...");
Highly Recommend
• Scala : Latest API feature, Stable
• Python
• very familiar language
• Native Lib: NumPy, SciPy
What is Spark• From UC Berkeley AMP Lab
• Apache Spark™ is a very fast and general engine for large-scale data processing
• Most activity Big data open source project since Hadoop
Community
Where is Spark?
HDFS
YARN
MapReduce
Hadoop 2.0
Storm HBase Others
HDFS
YARN
MapReduce
Hadoop Architecture
Hive
Storage
Resource Management
Computing Engine
SQL
HDFS
YARN
MapReduce
Hadoop vs Spark
Spark
Hive Shark/SparkSQL
More than MapReduce
HDFS
Spark Core : MapReduce
SparkSQL: Hive GraphX: Pregel MLib: MahoutStreaming:
Storm
Resource Management System(Yarn, Mesos)
How to use it?
• 1. go to https://spark.apache.org/
• 2. Download and unzip it
• 3. ./sbin/start-all.sh or ./bin/spark-shell
EC2
• ./ec2/spark-ec2 -k xxx -i xxx -s 3 launch CLUSTERNAME
!
• http://spark.apache.org/docs/latest/ec2-scripts.html
DEMO
Python Word Count• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" ")) \
• .map(lambda word: (word, 1)) \
• .reduceByKey(lambda a, b: a + b)
• counts.saveAsTextFile("hdfs://...")
BREAK
Spark Concept
Why is Spark so fast?
Most machine learning algorithms need iterative computing
a1.0
1.0
1.0
1.0
PageRank
1st Iter 2nd Iter 3rd Iter
b
d
c
Rank Tmp
Result
Rank Tmp
Result
a1.85
1.00.58
b
d
c
0.58
a1.31
1.720.39
b
d
c
0.58
HDFS is 100x slower than memory
Input (HDFS) Iter 1 Tmp
(HDFS) Iter 2 Tmp (HDFS) Iter N
Input (HDFS) Iter 1 Tmp
(Mem) Iter 2 Tmp (Mem) Iter N
MapReduce
Spark
First iteration(HDFS)!take 200 sec
3rd iteration(mem)!take 7.7 sec
Page Rank algorithm in 1 billion record url
2nd iteration(mem)!take 7.4 sec
Memory Size Problem
Cache storage in local disk(2sec)
Cache storage in memory(2sec)
Network transfer(30 sec)
Just memory?• From Matei’s paper: http://0rz.tw/VVqgP
• HBM: stores data in an in-memory HDFS instance.
• SP : Spark
• HBM’1, SP’1 : first run
• Storage: HDFS with 256 MB blocks
• Node information
• m1.xlarge EC2 nodes
• 4 cores
• 15 GB of RAM
100GB data on 100 node cluster
Logistic regression Ru
nnin
g Ti
me(
S)
0
35
70
105
140
HBM'1 HBM SP'1 SP3
4662
139
KMeans
Runn
ing
Tim
e(S)
0
50
100
150
200
HBM'1 HBM SP'1 SP
33
8287
182
Map Reduce
map
map
mapInput (HDFS) reduce
reduce
Shuffle
Output (HDFS)
Map Reduce
map
map
mapInput (HDFS) reduce
reduce
Shuffle
Output (HDFS)
Map Reduce
map,filtergroupBy on !
non-partitioned data
union
join with input!co-partitioned
join with inputs not!co-partitioned
Map(Narrow) Reduce(Wide)
DAG Engine
groupBy
map
union
join
Hadoop(4 MR)
groupBy
map
union
join
MR1
MR2
MR3MR4
Spark (2MR,1map)
groupBy
map
union
join
MR1
Map MR2
Input (HDFS)
MR1
MR2 Tmp (HDFS)
MapReduce
Spark
Tmp (HDFS) MR3
MR4
Input (HDFS)
MR1
MAP Tmp (MEM)
MR4
Tmp (MEM)
Output (HDFS)
Tmp (HDFS)
Output (HDFS)
CACHE
Stage 1
Stage 2
groupBy
map
union
join
Stage 2
RDD
• Resilient Distributed Dataset
• Interface of data, stored in RAM or on Disk
• Built through parallel transformations
RDD
RDD a RDD b
val a =sc.textFile(“hdfs://....”)
val b = a.filer( line=>line.contain(“Spark”) )
Value c
val c = b.count()
Transformation Action
Log mining
a = sc.textfile(“hdfs://aaa.com/a.txt”)!err = a.filter(lambda t=> “ERROR” in t )! .filter(lambda t=> “2014” in t)!!err.cache()!err.count()!!m = err.filter(lambda t=>“MYSQL” in t)!! ! .count()!a = err.filter(lambda t=> “APACHE” in t )!! ! .count()
Driver
Worker!!!!
Worker!!!!
Worker!!!!Task
TaskTask
Log mining
a = sc.textfile(“hdfs://aaa.com/a.txt”)!err = a.filter(lambda t=> “ERROR” in t )! .filter(lambda t=> “2014” in t)!!err.cache()!err.count()!!m = err.filter(lambda t=>“MYSQL” in t)!! ! .count()!a = err.filter(lambda t=> “APACHE” in t )!! ! .count()
Driver
Worker!!!!!Block1
RDD a
Worker!!!!!Block2
RDD a
Worker!!!!!Block3
RDD a
Log mining
a = sc.textfile(“hdfs://aaa.com/a.txt”)!err = a.filter(lambda t=> “ERROR” in t )! .filter(lambda t=> “2014” in t)!!err.cache()!err.count()!!m = err.filter(lambda t=>“MYSQL” in t)!! ! .count()!a = err.filter(lambda t=> “APACHE” in t )!! ! .count()
Driver
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Block1 Block2
Block3
Log mining
a = sc.textfile(“hdfs://aaa.com/a.txt”)!err = a.filter(lambda t=> “ERROR” in t )! .filter(lambda t=> “2014” in t)!!err.cache()!err.count()!!m = err.filter(lambda t=>“MYSQL” in t)!! ! .count()!a = err.filter(lambda t=> “APACHE” in t )!! ! .count()
Driver
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Block1 Block2
Block3
Log mining
a = sc.textfile(“hdfs://aaa.com/a.txt”)!err = a.filter(lambda t=> “ERROR” in t )! .filter(lambda t=> “2014” in t)!!err.cache()!err.count()!!m = err.filter(lambda t=>“MYSQL” in t)!! ! .count()!a = err.filter(lambda t=> “APACHE” in t )!! ! .count()
Driver
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Cache1 Cache2
Cache3
Log mining
a = sc.textfile(“hdfs://aaa.com/a.txt”)!err = a.filter(lambda t=> “ERROR” in t )! .filter(lambda t=> “2014” in t)!!err.cache()!err.count()!!m = err.filter(lambda t=>“MYSQL” in t)!! ! .count()!a = err.filter(lambda t=> “APACHE” in t )!! ! .count()
Driver
Worker!!!!!
RDD m
Worker!!!!!
RDD m
Worker!!!!!
RDD m
Cache1 Cache2
Cache3
Log mining
a = sc.textfile(“hdfs://aaa.com/a.txt”)!err = a.filter(lambda t=> “ERROR” in t )! .filter(lambda t=> “2014” in t)!!err.cache()!err.count()!!m = err.filter(lambda t=>“MYSQL” in t)!! ! .count()!a = err.filter(lambda t=> “APACHE” in t )!! ! .count()
Driver
Worker!!!!!
RDD a
Worker!!!!!
RDD a
Worker!!!!!
RDD a
Cache1 Cache2
Cache3
1st iteration(no cache)!
take same time
with cache!take 7 sec
RDD Cache
RDD Cache
• Data locality
• CacheA big shuffle!take 20min
After cache, take only 265ms
self join 5 billion record data
DEMO
Log Mining
Page Rank
a1.0
1.0
1.0
1.0
PageRank
1st Iter 2nd Iter 3rd Iter
b
d
c
Rank Tmp
Result
Rank Tmp
Result
a1.85
1.00.58
b
d
c
0.58
a1.31
1.720.39
b
d
c
0.58
SparkSQL
Recommendation
MLlib• Data
• data: [(36, 2802, 4.0), (36, 256, 4.0), …]
• rank, numIter, lambda are int
• candidates : [(0, 2),(0, 3),(0, 4)…]
• model = ALS.train(data, rank, numIter, lambda)
• model.predictAll(candidates)
Homework• 1. Install Spark and run word count (50%)
• Data : http://www.gutenberg.org/ebooks/5000
• Output: total word number
• 2. Write Movie Recommendation (50%)
• Trainning Data : http://arbor.ee.ntu.edu.tw/~wisely/data/lesson.tgz
• Input: 10 rating(1-5) on the 10 movie
• Example: movie 123 rating is 3 , movie 45 is 5
• Output: Top 10 recommendation movie
• Any algorithm is ok
BI (SparkSQL)
Streaming (SparkStreaming)
Machine Learning (MLlib)
Spark
Background Knowledge• Tweet real time data store into SQL database
• Spark MLLib use Wikipedia data to train a TF-IDF model
• SparkSQL select tweet and filter by TF-IDF model
• Generate live BI report
Code• val wiki = sql(“select text from wiki”)
• val model = new TFIDF()
• model.train(wiki)
• registerFunction(“similarity” , model.similarity _ )
• select tweet from tweet where similarity(tweet, “$search” > 0.01 )
DEMO
http://youtu.be/dJQ5lV5Tldw?t=39m30s
Q & A