+ All Categories
Home > Technology > [OracleCode SF] In memory analytics with apache spark and hazelcast

[OracleCode SF] In memory analytics with apache spark and hazelcast

Date post: 12-Apr-2017
Category:
Upload: viktor-gamov
View: 173 times
Download: 0 times
Share this document with a friend
42
@gamussa @hazelcast #oraclecode IN-MEMORY ANALYTICS with APACHE SPARK and HAZELCAST
Transcript

@gamussa @hazelcast #oraclecode

IN-MEMORY ANALYTICS with APACHE SPARK and

HAZELCAST

@gamussa @hazelcast #oraclecode

Solutions ArchitectDeveloper Advocate

@gamussa in internetz

Please, follow me on TwitterI’m very interesting ©

Who am I?

@gamussa @hazelcast #oraclecode

What’s Apache Spark?Lightning-Fast Cluster Computing

@gamussa @hazelcast #oraclecode

Run programs up to 100xfaster than Hadoop

MapReduce in memory, or 10x faster on disk.

@gamussa @hazelcast #oraclecode

When to use Spark?

Data Science Taskswhen questions are unknown

Data Processing Taskswhen you have to much data

You’re tired of Hadoop

@gamussa @hazelcast #oraclecode

Spark Architecture

@gamussa @hazelcast #oraclecode

@gamussa @hazelcast #oraclecode

RDD

@gamussa @hazelcast #oraclecode

Resilient Distributed Datasets (RDD)

are the primary abstraction in Spark –

a fault-tolerant collection of elements that can be

operated on in parallel

@gamussa @hazelcast #oraclecode

@gamussa @hazelcast #oraclecode

RDD Operations

@gamussa @hazelcast #oraclecode

operations on RDDs: transformations and actions

@gamussa @hazelcast #oraclecode

transformations are lazy (not computed immediately)

the transformed RDD gets recomputed when an action is run on it (default)

@gamussa @hazelcast #oraclecode

RDD Transformations

@gamussa @hazelcast #oraclecode

@gamussa @hazelcast #oraclecode

@gamussa @hazelcast #oraclecode

RDD Actions

@gamussa @hazelcast #oraclecode

@gamussa @hazelcast #oraclecode

@gamussa @hazelcast #oraclecode

RDD Fault Tolerance

@gamussa @hazelcast #oraclecode

@gamussa @hazelcast #oraclecode

RDD Construction

@gamussa @hazelcast #oraclecode

parallelized collections

take an existing Scala collection and run functions on it in parallel

@gamussa @hazelcast #oraclecode

Hadoop datasets

run functions on each record of a file in Hadoop distributed file system or any other storage system supported by

Hadoop

@gamussa @hazelcast #oraclecode

What’s Hazelcast IMDG?The Fastest In-memory Data Grid

@gamussa @hazelcast #oraclecode

Hazelcast IMDGis an operational,

in-memory, distributed computing platform

that manages data using in-memory storage, and

performs parallel execution for breakthrough application speed

and scale

@gamussa @hazelcast #oraclecode

High-DensityCaching

In-Memory Data Grid

Web Session Clustering

MicroservicesInfrastructure

@gamussa @hazelcast #oraclecode

What’s Hazelcast IMDG?

In-memory Data Grid

Apache v2 Licensed

Distributed Caches (IMap, JCache)

Java Collections (IList, ISet, IQueue)

Messaging (Topic, RingBuffer)

Computation (ExecutorService, M-R)

@gamussa @hazelcast #oraclecode

GreenPrimary

GreenBackup

GreenShard

@gamussa @hazelcast #oraclecode

@gamussa @hazelcast #oraclecode

final SparkConf sparkConf = new SparkConf().set("hazelcast.server.addresses", "localhost").set("hazelcast.server.groupName", "dev").set("hazelcast.server.groupPass", "dev-pass").set("hazelcast.spark.readBatchSize", "5000").set("hazelcast.spark.writeBatchSize", "5000").set("hazelcast.spark.valueBatchingEnabled", "true");

final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf);

final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);

final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-cache");

@gamussa @hazelcast #oraclecode

final SparkConf sparkConf = new SparkConf().set("hazelcast.server.addresses", "localhost").set("hazelcast.server.groupName", "dev").set("hazelcast.server.groupPass", "dev-pass").set("hazelcast.spark.readBatchSize", "5000").set("hazelcast.spark.writeBatchSize", "5000").set("hazelcast.spark.valueBatchingEnabled", "true");

final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf);

final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);

final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-cache");

@gamussa @hazelcast #oraclecode

final SparkConf sparkConf = new SparkConf().set("hazelcast.server.addresses", "localhost").set("hazelcast.server.groupName", "dev").set("hazelcast.server.groupPass", "dev-pass").set("hazelcast.spark.readBatchSize", "5000").set("hazelcast.spark.writeBatchSize", "5000").set("hazelcast.spark.valueBatchingEnabled", "true");

final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf);

final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);

final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-cache");

@gamussa @hazelcast #oraclecode

final SparkConf sparkConf = new SparkConf().set("hazelcast.server.addresses", "localhost").set("hazelcast.server.groupName", "dev").set("hazelcast.server.groupPass", "dev-pass").set("hazelcast.spark.readBatchSize", "5000").set("hazelcast.spark.writeBatchSize", "5000").set("hazelcast.spark.valueBatchingEnabled", "true");

final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf);

final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);

final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-cache");

@gamussa @hazelcast #oraclecode

Demo

@gamussa @hazelcast #oraclecode

LIMITATIONS

@gamussa @hazelcast #oraclecode

DATA SHOULD NOT BE UPDATED WHILE READING

FROM SPARK

@gamussa @hazelcast #oraclecode

WHY ?

@gamussa @hazelcast #oraclecode

MAP EXPANSION SHUFFLES THE DATA INSIDE THE BUCKET

@gamussa @hazelcast #oraclecode

CURSOR DOESN’T POINT TO CORRECT ENTRY ANYMORE,

DUPLICATE OR MISSINGENTRIES COULD OCCUR

@gamussa @hazelcast #oraclecode

github.com/hazelcast/hazelcast-spark

@gamussa @hazelcast #oraclecode

THANKS!Any questions?You can find me at@[email protected]


Recommended