+ All Categories
Home > Data & Analytics > Data analysis scala_spark

Data analysis scala_spark

Date post: 14-Apr-2017
Category:
Upload: yiguang-hu
View: 21 times
Download: 0 times
Share this document with a friend
8
DATA ANALYSIS WITH SCALA/SPARK YIGUANG HU
Transcript
Page 1: Data analysis scala_spark

DATA ANALYSIS WITH

SCALA/SPARK

YIGUANG HU

Page 2: Data analysis scala_spark

DOWNLOAD/INSTALL SCALA, SPARK

• DOWNLOAD/INSTALL SCALA FROM• HTTPS://WWW.SCALA-LANG.ORG

• DOWNLOAD/INSTALL SPARK FROM • HTTP://SPARK.APACHE.ORG

• SETUP SCALA_HOME AND SPARK_HOME BASED ON INSTALL

Page 3: Data analysis scala_spark

TEST INSTALL

• run $SPARK_HOME/bin/spark-shell

• should bring you to prompt

• scala>

• run a few commands on such as

• val lines=sc.parallelize(List("Hello world”,"hi"))

• lines.count()

• should print 2

Page 4: Data analysis scala_spark

LOAD TEXT FILE

scala> val kjv=sc.textFile(“kjv.txt")kjv: org.apache.spark.rdd.RDD[String] = kjv.txt MapPartitionsRDD[1] at textFile at <console>:24

scala> kjv.count()res0: Long = 31143

Now Bible text is loaded into kjv

Page 5: Data analysis scala_spark

SEARCH

• How many verses contains word “God”?

scala> val god=kjv.filter(x=>x.contains("God"))god: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at filter at <console>:26

scala> god.count()res2: Long = 3585

• How many verses contains word “Love”/“love”scala> val love=kjv.filter(x=> x.contains("Love")||x.contains("love"))love: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at filter at <console>:26

scala> love.count()res3: Long = 546

Page 6: Data analysis scala_spark

SEARCH

How many verses contains God or Love?

scala> val god_or_love=god.union(love)god_or_love: org.apache.spark.rdd.RDD[String] = UnionRDD[5] at union at <console>:30

scala> god_or_love.count()res4: Long = 4131

scala>

How many verses contains both God and Love?

scala> val god_and_love=god.intersection(love)god_and_love: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at intersection at <console>:30

scala> god_and_love.count()res5: Long = 100

Page 7: Data analysis scala_spark

STATISTICS

• How many words are used in Bible?scala> val words = kjv.flatMap(line=>line.split(" "))words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at flatMap at <console>:26

scala> words.count()res6: Long = 820867

How many unique words are used in Bible?scala> val distinctword=words.distinct()distinctword: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[15] at distinct at <console>:28

scala> distinctword.count()res7: Long = 60023

• How many times each word is used in Bible?scala> count_by_word.toList.sortBy(_._2).reverse.take(10).foreach (println)(the,62050)(and,38571)(of,34393)(to,13363)(And,12739)(that,12454)(in,12167)(shall,9759)(he,9509)(unto,8931)

Page 8: Data analysis scala_spark

STATISTICS

• How many verses in Genesis

scala> val ge=kjv.filter(line=>line.startsWith("Ge"))ge: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[22] at filter at <console>:26

scala> ge.count()res24: Long = 1534


Recommended