Building a Stock Prediction system with Machine Learning using Geode, SpringXD and Spark MLLib

transcript

Building a Stock Prediction system with Machine Learning using Geode, Spring XD

e Spark MLLib

William Markito@william_markit

Fred Melo@fredmelo_br

It's all about DATA

Data SourcesLook for patterns

Prediction

medium avg (x+1)

relative strength

medium avg (x)

price(x)

Machine Learning Model (e.g. Linear

Regression)

Transform Sink

SpringXD

ExtensibleOpen-SourceFault-TolerantHorizontally ScalableCloud-Native

Machine Learning

Enrich Filter

Dashboard

Indicators

Predict

Real data

Simulator

/Stocks

/TechIndicators

/Predictions

Apache Geode (incubating)

Introduction

A distributed, memory-based data management platform for data oriented apps that need:High performance, scalability, resiliency and continuous

availabilityFast access to critical data setLocation aware distributed data processingEvent driven data architecture

Concepts

CacheIn-memory storage and management for

your dataConfigurable through XML, Spring, Java

API or CLICollection of Region

Concepts

RegionDistributed java.util.Map on steroids

(Key/Value)

Consistent API regardless of where or how data is stored

Observable (reactive)

Highly available, redundant on cache Member (s).

Concepts

RegionLocal, Replicated or PartitionedIn-memory or persistentRedundantLRU Overflow

LOCALLOCAL_HEAP_LRULOCAL_OVERFLOWLOCAL_PERSISTENTLOCAL_PERSISTENT_OVERFLOWPARTITIONPARTITION_HEAP_LRUPARTITION_OVERFLOWPARTITION_PERSISTENTPARTITION_PERSISTENT_OVERFLOWPARTITION_PROXYPARTITION_PROXY_REDUNDANTPARTITION_REDUNDANTPARTITION_REDUNDANT_HEAP_LRUPARTITION_REDUNDANT_OVERFLOWPARTITION_REDUNDANT_PERSISTENTPARTITION_REDUNDANT_PERSISTENT_OVERFLOWREPLICATEREPLICATE_HEAP_LRUREPLICATE_OVERFLOWREPLICATE_PERSISTENTREPLICATE_PERSISTENT_OVERFLOWREPLICATE_PROXY

Concepts

MemberA process that has a connection to the systemA process that has created a cacheEmbeddable within your application

Client

Locator

Server

Concepts

Client cacheA process connected to the Geode server(s)Can have a local copy of the dataCan be notified about events on the servers

Concepts

ListenersCacheWriter / CacheListenerAsyncEventListener (queue / batch)Parallel or SerialConflation

Currently under incubation in Apache Software FoundationWelcome contributions and contributors

Code and PatchesBugs, feature requestsDocumentation and contentAny form of feedback

CodeNew features

Bug fixes (patches)

Writing tests

DocumentationWiki

Web site

User guides

CommunityJoin our mailing lists (Ask or answer)

Become a speaker

Find and report bugs

Testing a release candidate or beta

JIRA - https://issues.apache.org/jira/browse/GEODEGitHub - https://github.com/apache/incubator-geodeMailing lists:

Development - dev@geode.incubator.apache.org

Users - user@geode.incubator.apache.org

Wiki - cwiki.apache.org/confluence/display/GEODEStackOverflow - http://stackoverflow.com/questions/tagged/geode+or+gemfire

SpringXDIntroduction

Concepts

Concepts A stream is composed from modules. Each module is deployed to a container and its

channels are bound to the transport.

Apache Zeppelin(incubating)

Introduction

Concepts

Web based REPLIterative & ExploratorySupport for Data Ingestion

Concepts

Multi interpretersMarkdownShellSparkGeodePython…

Concepts

Sharing through URLs without Reports

Apache SparkIntroduction

Concepts

RDDDataframeDriverWorker

"An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes."

Concepts

RDDDataframeDriverWorker

“A dataframe is a distributed collection of rows organized into named columns. An abstraction for selecting, filtering and plotting structured data (pandas), previously known as SchemaRDD."

Concepts

RDDDataframe DriverWorker

Summary

• Integration• Spark, JDBC, Geode• HDFS, Twitter, File, Mail…

• Data pipeline orchestration• Intuitive DSL• Streaming & Analytics• Distributed and scalable

• Web based REPL• Multiple Interpreters

• Apache Spark• Markdown• Flink• Python• Geode…

• Iterative & Exploratory

Summary

• Fast data processing• Columnar queries• RDDs• Machine Learning• Analytics & Streaming

• Fast data store and processing• In-memory & Persistent• Highly Consistent• Transaction processing• Thousands of concurrent

clients

Source Codehttp://pivotal-open-source-hub.github.io/StockInference-Spark/