+ All Categories
Home > Documents > Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An...

Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An...

Date post: 04-Aug-2018
Category:
Upload: leliem
View: 222 times
Download: 0 times
Share this document with a friend
25
www.quoininc.com Boston Charlotte New York Washington DC Managua Big Data with Apache Spark and Amazon AWS Lance Parlier 16 February 2017
Transcript
Page 1: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

www.quoininc.com Boston Charlotte New York Washington DC Managua

Big Data with Apache Spark and Amazon AWS

Lance Parlier16 February 2017

Page 2: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Big Data with Apache Spark and Amazon AWS

An introduction to big data applications using Apache Spark, running on Amazon AWS EC2 clusters.

This is a short introduction to some big data programs using Apache Spark. These programs will be ran locally and on Amazon AWS EC2 clusters, while comparing the differences in performance.

Big Data, Apache Spark, Amazon AWS

16February2017 QuoinInc. 2

Page 3: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Definitions

• Big Data: A term used for computationally analyzing extremely large data sets to reveal patterns and trends.

• Apache Spark: A fast, in-memory data processing engine. We will be using Spark’s engine for batch processing in the examples.

• Amazon AWS: A secure cloud services platform.• EC2: Elastic Compute Cloud(EC2). Virtual computers for rent on AWS.• S3: Simple Storage Service(S3). Storage on AWS.

• Scala: A general-purpose programming language. Is object-oriented(similar to Java). Has full support for functional programming.

16February2017 QuoinInc. 3

Page 4: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Definitions cont.

• Hadoop: An open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

* Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology.

4

Page 5: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Why Do We Need Big Data?

5

Page 6: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Apache Spark

6

Page 7: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

An Example: Wordcount(local)

7

Page 8: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

An Example: Wordcount(local) cont.

• We will first run this example locally, with only one worker thread.• The initial input size will be 258 megabytes.

*The local machine this was run on has an Intel i5 processor and 8 gigabytes of ram.

8

Page 9: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

An Example: Wordcount(local) cont.

9

Page 10: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

An Example: Wordcount(local) cont.

10

Page 11: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

An Example: Wordcount(local) cont.

11

Page 12: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Wordcount(AWS)

12

Page 13: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Wordcount(AWS) cont.

What we will do:• Use the spark-ec2 script to create the clusters.• Upload the input files to S3.• Upload the jar to the master.• SSH into the master, and run the jar.

13

Page 14: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Wordcount(AWS) cont.

14

Page 15: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Wordcount(AWS) cont.

15

Page 16: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Wordcount(AWS) cont.

16

Page 17: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Wordcount(AWS) cont.

17

Page 18: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Big Data?

• In these examples, 258MB isn’t really what most would consider big data.

• So, we will ramp the input sizes up to 3.5 gigabytes and try again.

18

Page 19: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Bigger Data(local)

19

Page 20: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Bigger Data(AWS)

20

Page 21: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Ramping it up

• Since using Amazon’s t2.micro instances didn’t perform the way we wanted, let’s ramp it up some.

• Now, we will use 6(1 Master and 5 Slaves) m4.xlarge instances.• These each have 16 gigabytes of memory each, compared to the

t2.micro’s 1 gigabyte per instance.

21

Page 22: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Ramping it up cont.

22

Page 23: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Summarizing Performance

23

LOCAL AWS (t2.micro) AWS (m4.xlarge)

SMALL DATASET 27 Seconds 56 Seconds 22 Seconds

LARGE DATASET 254 Seconds 243 Seconds 150 Seconds

Page 24: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Summarizing Performance cont.

• Using the smaller instances on AWS, performance was worse, or only slightly better. This could be because using the smaller instances only have 1 gigabyte of ram(so 6 total), versus the 8 gigabytes on the local machine. On top of that, the AWS clusters have the overhead of distributing jobs between the slaves.

• We can see the advantages once we begin to use larger instances/clusters on larger datasets. There is some overhead to distributed computing, but on a large enough dataset/more complex code, it becomes clearer why we need it.

24

Page 25: Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An introduction to big data applications using Apache Spark, running on Amazon AWS EC2

Sources

• http://www.webopedia.com/TERM/B/big_data.html• http://spark.apache.org/• http://www.crackinghadoop.com/apache-spark-101-introduction-for-

big-data-newcomers/

25


Recommended