[OracleCode SF] In memory analytics with apache spark and hazelcast

@gamussa @hazelcast #oraclecode

IN-MEMORY ANALYTICS with APACHE SPARK and

HAZELCAST


Solutions ArchitectDeveloper Advocate

@gamussa in internetz

Please, follow me on TwitterI’m very interesting ©

Who am I?


What’s Apache Spark?Lightning-Fast Cluster Computing


Run programs up to 100xfaster than Hadoop

MapReduce in memory, or 10x faster on disk.


When to use Spark?

Data Science Taskswhen questions are unknown

Data Processing Taskswhen you have to much data

You’re tired of Hadoop


Spark Architecture



RDD


Resilient Distributed Datasets (RDD)

are the primary abstraction in Spark –

a fault-tolerant collection of elements that can be

operated on in parallel



RDD Operations


operations on RDDs: transformations and actions


transformations are lazy (not computed immediately)

the transformed RDD gets recomputed when an action is run on it (default)


RDD Transformations




RDD Actions




RDD Fault Tolerance



RDD Construction


parallelized collections

take an existing Scala collection and run functions on it in parallel


Hadoop datasets

run functions on each record of a file in Hadoop distributed file system or any other storage system supported by

Hadoop


What’s Hazelcast IMDG?The Fastest In-memory Data Grid


Hazelcast IMDGis an operational,

in-memory, distributed computing platform

that manages data using in-memory storage, and

performs parallel execution for breakthrough application speed

and scale


High-DensityCaching

In-Memory Data Grid

Web Session Clustering

MicroservicesInfrastructure


What’s Hazelcast IMDG?

In-memory Data Grid

Apache v2 Licensed

Distributed Caches (IMap, JCache)

Java Collections (IList, ISet, IQueue)

Messaging (Topic, RingBuffer)

Computation (ExecutorService, M-R)


GreenPrimary

GreenBackup

GreenShard



final SparkConf sparkConf = new SparkConf().set("hazelcast.server.addresses", "localhost").set("hazelcast.server.groupName", "dev").set("hazelcast.server.groupPass", "dev-pass").set("hazelcast.spark.readBatchSize", "5000").set("hazelcast.spark.writeBatchSize", "5000").set("hazelcast.spark.valueBatchingEnabled", "true");

final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf);

final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);

final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-cache");

















Demo


LIMITATIONS


DATA SHOULD NOT BE UPDATED WHILE READING

FROM SPARK


WHY ?


MAP EXPANSION SHUFFLES THE DATA INSIDE THE BUCKET


CURSOR DOESN’T POINT TO CORRECT ENTRY ANYMORE,

DUPLICATE OR MISSINGENTRIES COULD OCCUR


github.com/hazelcast/hazelcast-spark


THANKS!Any questions?You can find me at@[email protected]

Date post:	12-Apr-2017
Category:	Technology
Upload:	viktor-gamov
View:	173 times
Download:	0 times

[OracleCode SF] In memory analytics with apache spark and hazelcast

Technology