Date post: | 14-Apr-2017 |
Category: |
Data & Analytics |
Upload: | yiguang-hu |
View: | 21 times |
Download: | 0 times |
DATA ANALYSIS WITH
SCALA/SPARK
YIGUANG HU
DOWNLOAD/INSTALL SCALA, SPARK
• DOWNLOAD/INSTALL SCALA FROM• HTTPS://WWW.SCALA-LANG.ORG
• DOWNLOAD/INSTALL SPARK FROM • HTTP://SPARK.APACHE.ORG
• SETUP SCALA_HOME AND SPARK_HOME BASED ON INSTALL
TEST INSTALL
• run $SPARK_HOME/bin/spark-shell
• should bring you to prompt
• scala>
• run a few commands on such as
• val lines=sc.parallelize(List("Hello world”,"hi"))
• lines.count()
• should print 2
LOAD TEXT FILE
scala> val kjv=sc.textFile(“kjv.txt")kjv: org.apache.spark.rdd.RDD[String] = kjv.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> kjv.count()res0: Long = 31143
Now Bible text is loaded into kjv
SEARCH
• How many verses contains word “God”?
scala> val god=kjv.filter(x=>x.contains("God"))god: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at filter at <console>:26
scala> god.count()res2: Long = 3585
• How many verses contains word “Love”/“love”scala> val love=kjv.filter(x=> x.contains("Love")||x.contains("love"))love: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at filter at <console>:26
scala> love.count()res3: Long = 546
SEARCH
How many verses contains God or Love?
scala> val god_or_love=god.union(love)god_or_love: org.apache.spark.rdd.RDD[String] = UnionRDD[5] at union at <console>:30
scala> god_or_love.count()res4: Long = 4131
scala>
How many verses contains both God and Love?
scala> val god_and_love=god.intersection(love)god_and_love: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at intersection at <console>:30
scala> god_and_love.count()res5: Long = 100
STATISTICS
• How many words are used in Bible?scala> val words = kjv.flatMap(line=>line.split(" "))words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at flatMap at <console>:26
scala> words.count()res6: Long = 820867
How many unique words are used in Bible?scala> val distinctword=words.distinct()distinctword: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[15] at distinct at <console>:28
scala> distinctword.count()res7: Long = 60023
• How many times each word is used in Bible?scala> count_by_word.toList.sortBy(_._2).reverse.take(10).foreach (println)(the,62050)(and,38571)(of,34393)(to,13363)(And,12739)(that,12454)(in,12167)(shall,9759)(he,9509)(unto,8931)
STATISTICS
• How many verses in Genesis
scala> val ge=kjv.filter(line=>line.startsWith("Ge"))ge: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[22] at filter at <console>:26
scala> ge.count()res24: Long = 1534