Date post: | 12-Apr-2017 |
Category: |
Documents |
Upload: | syed-danyal-khaliq |
View: | 117 times |
Download: | 5 times |
5/25/2015
By | Danyal , Baqir and Shoaib
IBA-FCS INTRODUCTORY DATA ANALYTICS WITH APACHE
SPARK
Table of Contents
Setting Up Network Of Oracle Virtual Box: ....................................................................................................................... 3
Setup Spark On Windows 7 Standalone Mode: .................................................................................................................. 4
Setup Spark On Linux: ......................................................................................................................................................... 5
Spark Features: ...................................................................................................................................................................... 5
Spark Standalone Cluster Mode: ............................................................................................................................ 6
Setup Master Node ................................................................................................................................................... 6
Starting Cluster Manually .................................................................................................................................................... 7
Starting & Connecting Worker To Master ......................................................................................................................... 8
Submitting Applications To Spark ....................................................................................................................................... 9
Interactive Analysis With The Spark Shell ......................................................................................................................... 9
Spark Submit ......................................................................................................................................................................... 9
Custom Code Execution & Hdfs File Reading .................................................................................................................. 10
Setting up network of Oracle Virtual Box:
1. When starting the VM set the Attached to value to Bridged Adapter (to make it connect to other
VMs on the network , which reside on different Host machines)
2. Refresh the MAC Address by clicking the REFRESH Icon, so that every VM must have a different
MAC to have a unique IP assigned to it.
Figure 1: VM setting to follow when creating the network
3. Then make sure you are able to PING from Host to VM and vice versa when the VM starts after
applying the above settings.
Setup SPARK on Windows 7 Standalone Mode:
Prerequisites:
Java6+
Scala 2.10
Python 2.6 +
Spark 1.2.x
sbt ( In case of building Spark Source code)
GIT( If you use sbt tool)
Environment Variables:
Set JAVA_HOME and PATH variable as environment variables.
Download Scala 2.10 and install
Set SCALA_HOME andadd %SCALA_HOME%\bin in PATH variable in environment variables. To test
whether Scala is installed or not, run following command.
Downloading & Setting up Spark:
Choose a Spark prebuilt package for Hadoop i.e. Prebuilt for Hadoop 2.3/2.4 or later. Download and
extract it to any drive i.e. D:\spark-1.2.1-bin-hadoop2.3
Set SPARK_HOME and add %SPARK_HOME%\bin in PATH in environment variables
Download winutils.exe and place it in any location (i.e. D:\winutils\bin\winutils.exe) to avoid any
Hadoop Errors
Set HADOOP_HOME = D:\winutils in environment variable
Now, Re run the command “spark-shell’ , you’ll see the scala shell .
For Spark UI : open http://localhost:4040/ in browser
ctrl + z to get out of it when it executes successfully .
For testing the successful setup you can run the example :
If all goes fine , this will execute this sample program and return the result on console .
And that’s how you have setup Spark on windows 7 .
Master Web UI can be accessed on url SPARK://IP:7077
Setup SPARK on Linux: Download Hortonworks Sandbox ,with Spark Installed and Configured from following link
http://hortonworks.com/hdp/downloads/
HDP 2.2.4 on Sandbox.
You can find Spark in this directory
o /usr/hdp/2.2.4.2-2/spark/bin
You are now all ready to go as HortonWorks have this setup for you pre-packaged.
SPARK Features: Speed - Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory
computing.
Ease of Use – Write apps quickly in Java , Scala or Python . Spark offer 80 high level operators that
make it easy to build parallel apps.
Generality – Combines SQL , Streaming and complex analytics , powers its stack with high level tools
including Spark SQL , MLib (for machine learning and AI),GraphX & Spark streaming .
Runs Everywhere – Spark runs on Hadoop , standalone Mode or in cloud as well , can access diverse
data sources including HDFS , Cassandra , Hbase and S3 .
SPARK Standalone Cluster Mode:
There are 2 other Spark deployment modes i.e. YARN and MESOS but here we will only talk about
standalone mode.
Spark is already installed on HortonWorks VM in standalone mode on your VM node. Simply acquire
pre-built version of spark for any future implementations.
Setup Master Node
1. Edit etc/hosts file by VI Editor
2. Edit the file like below , here we have shown 2 slaves (which must be setup the same
way as this VM , and we must enter information like the same way in etc hosts file
on slaves too)
3. Then You should change you Masters host name machine to Master and slaves
machine to slave1 and slave2 respectively as shown below
4. Now on slave machines, there is one more additional step to be performed and that
is to create a file called conf/slaves in your Spark directory, which must contain the
hostnames of all the machines where you intend to start Spark workers, one per
line. If conf/slaves does not exist, the launch scripts defaults to a single machine
(localhost), which is useful for testing only.
5. Now if everything have gone according to plan then your Multi Cluster Spark setup
over Network is up and running and you can verify by Ping to other machines from
every machine , in our case there were 3 machines (1 master , 2 slaves)
Starting Cluster Manually
You can start standalone master server by executing the following
./sbin/start-master.sh
Once started, the master will print out the MASTER URL i.e.
Spark: //HOSTNAME:PORT, which will be used to connect workers to it.
This url can also be find on WebUI , whose default is url is http://localhost:8080
sbin/start-slaves.sh - Starts a slave instance on each machine specified in
the conf/slaves file.
sbin/start-all.sh - Starts both a master and a number of slaves as described above.
sbin/stop-master.sh - Stops the master that was started via the bin/start-master.sh script.
sbin/stop-slaves.sh - Stops all slave instances on the machines specified in
the conf/slaves file.
sbin/stop-all.sh - Stops both the master and the slaves as described above.
Starting & Connecting Worker to Master
You can start and connect worker(S) to master via this command
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://HOST:PORT Once you have started a worker, look at the master’s web UI (http://localhost:8080 by
default). You should see the new node listed there, along with its number of CPUs and
memory (minus one gigabyte left for the OS).
Now once you connect workers to master successfully (checked it by browsing the web UI ,
salves machines will be shown there (like below) , any task you run from master to slave will
be distributed to all the available machines in network to be worked on in parallel .
Figure 2 : Master Web UI
Submitting Applications to Spark
We have 2 ways to do this.
Interactive Analysis with the Spark Shell
Pyspark
./bin/pyspark –master spark://IP:port
Spark-shell
./bin/spark-shell – master spark://IP:port
There are many parameters to be passed with above commands, links of which are given at the
end of this document. Running pyspark or spark-shell commands will open up an interactive
shell for you to work on, write code line by line, pressing enter, for example as below
textFile = sc.textFile("README.md")
textFile.count() # Number of items in this RDD
textFile.first() # First item in this RDD
where ‘sc’ is SPARK CONTEXT object , which is made available by spark when you run either
pyspark or spark-shell commands . Behind the scenes, spark-shell invokes the more
general spark-submit script.
Spark Submit
Once a user application is bundled, it can be launched using the bin/spark-submit script. This
script takes care of setting up the classpath with Spark and its dependencies, and can support
different cluster managers and deploy modes that Spark supports:
For simple example consider this from inside spark folder, execute something like this:
./bin/spark-submit --master spark://master:7077 k-means.py
Where master means where to to submit the app and then the file name to run , either of scala
, java or python .
Template for spark submit command
./bin/spark-submit
--class <main-class>
--master <master-url>
--deploy-mode <deploy-mode>
--conf <key>=<value>
... # other options
<application-jar>
[application-arguments]
Some of the commonly used options are:
--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
--master: The master URL for the cluster (e.g. spark://hostname or IP:7077)
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an
external client (client) (default: client)
--conf: Arbitrary Spark configuration property in key=value format. For values that contain
spaces wrap “key=value” in quotes (as shown).
application-jar: Path to a bundled jar including your application and all dependencies. The URL
must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is
present on all nodes.
application-arguments: Arguments passed to the main method of your main class, if any
Custom code execution & HDFS File reading
To read / access file text inside your python or scala programs , the file must reside inside HDFS
,only then will sc.textFile(“file.txt”) will be able to read that and access the content . See the
commands below as sample to create HDFS directory and put a file in it
hadoop fs –mkdir hdfs://master/user/hadoop/spark/data/
to create directory inside HDFS with given path uri
hadoop fs –ls hdfs://master/user/hadoop/spark/data/
to view directory contents inside HDFS with given path uri
hadoop fs –put home/sample.txt hdfs://master/user/hadoop/spark/data/
to upload file onto HDFS from local Linux directory (home/ in this case)
hadoop fs –get hdfs://master/user/hadoop/spark/data/sample.txt home/
to download file from HDFS dir to local Linux directory (home/ in this case)
Sample Code given below which read a file from HDFS directory
1. from pyspark.mllib.clustering import KMeans 2. from numpy import array 3. from math import sqrt 4. from pyspark import SparkContext 5. import time 6. start_time = time.time() 7. 8. sc = SparkContext(appName="K means") 9. # Load and parse the data 10. data = sc.textFile("hdfs://master/user/hadoop/spark/data/kmeans.csv") 11. header = data.first() 12. parsedData = data.filter(lambda x: x != header).map(lambda line: array([float(x) for x in li
ne.split(',')])) 13. 14. # Build the model (cluster the data) 15. clusters = KMeans.train(parsedData, 2, maxIterations=100, 16. runs=10, initializationMode="random") 17. 18. # Evaluate clustering by computing Within Set Sum of Squared Errors 19. def error(point): 20. center = clusters.centers[clusters.predict(point)] 21. return sqrt(sum([x**2 for x in (point - center)])) 22. 23. WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) 24. print("Within Set Sum of Squared Error = " + str(WSSSE)) 25. print("--- %s seconds ---" % (time.time() - start_time))
Aggregate Function Benchmarking
Query Windows Spark Cluster .8M records - Total amount spent
on all transaction per customer 30 secs 7 secs
10M records – Transaction count 167 secs 8 secs
10M records – Total amount sum 106 secs 10 secs
K-Means Clustering Benchmarking
Cluster config Windows Spark Cluster K=4, iter = 100 , python, rows = 1048576 82 Secs 25 Secs
K=4, iter = 1000 , python , rows = 1048576 800 Secs 31 Secs
k=4, iter = 1000, Scala , rows = 1048576 - 13 Secs
Refrences:
1. https://spark.apache.org/docs/1.3.0/index.html
2. http://hortonworks.com/hadoop-tutorial/using-commandline-manage-files-hdfs/
3. https://spark.apache.org/docs/1.3.0/spark-standalone.html
4. https://spark.apache.org/docs/1.3.0/quick-start.html
5. https://spark.apache.org/docs/1.3.0/submitting-applications.html