Post on 26-Jan-2017
transcript
‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›
Building a Stock Prediction system with Machine Learning using Geode, Spring XD
e Spark MLLib
William Markito@william_markit
o
Fred Melo@fredmelo_br
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
It's all about DATA
Data SourcesLook for patterns
Prediction
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
medium avg (x+1)
relative strength
(x)
medium avg (x)
price(x)
Machine Learning Model (e.g. Linear
Regression)
© Copyright 2014 Pivotal. All rights reserved.
Transform Sink
SpringXD
ExtensibleOpen-SourceFault-TolerantHorizontally ScalableCloud-Native
Machine Learning
Enrich Filter
Split
Dashboard
Indicators
1
2
Predict
3
Real data
Simulator
/Stocks
/TechIndicators
/Predictions
‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›
Apache Geode (incubating)
Introduction
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Introduction
A distributed, memory-based data management platform for data oriented apps that need:High performance, scalability, resiliency and continuous
availabilityFast access to critical data setLocation aware distributed data processingEvent driven data architecture
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
CacheIn-memory storage and management for
your dataConfigurable through XML, Spring, Java
API or CLICollection of Region
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
RegionDistributed java.util.Map on steroids
(Key/Value)
Consistent API regardless of where or how data is stored
Observable (reactive)
Highly available, redundant on cache Member (s).
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
RegionLocal, Replicated or PartitionedIn-memory or persistentRedundantLRU Overflow
LOCALLOCAL_HEAP_LRULOCAL_OVERFLOWLOCAL_PERSISTENTLOCAL_PERSISTENT_OVERFLOWPARTITIONPARTITION_HEAP_LRUPARTITION_OVERFLOWPARTITION_PERSISTENTPARTITION_PERSISTENT_OVERFLOWPARTITION_PROXYPARTITION_PROXY_REDUNDANTPARTITION_REDUNDANTPARTITION_REDUNDANT_HEAP_LRUPARTITION_REDUNDANT_OVERFLOWPARTITION_REDUNDANT_PERSISTENTPARTITION_REDUNDANT_PERSISTENT_OVERFLOWREPLICATEREPLICATE_HEAP_LRUREPLICATE_OVERFLOWREPLICATE_PERSISTENTREPLICATE_PERSISTENT_OVERFLOWREPLICATE_PROXY
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
MemberA process that has a connection to the systemA process that has created a cacheEmbeddable within your application
Client
Locator
Server
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
Client cacheA process connected to the Geode server(s)Can have a local copy of the dataCan be notified about events on the servers
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
ListenersCacheWriter / CacheListenerAsyncEventListener (queue / batch)Parallel or SerialConflation
© Copyright 2014 Pivotal. All rights reserved.
Apache Geode (incubating)
Currently under incubation in Apache Software FoundationWelcome contributions and contributors
Code and PatchesBugs, feature requestsDocumentation and contentAny form of feedback
© Copyright 2014 Pivotal. All rights reserved.
CodeNew features
Bug fixes (patches)
Writing tests
DocumentationWiki
Web site
User guides
CommunityJoin our mailing lists (Ask or answer)
Become a speaker
Find and report bugs
Testing a release candidate or beta
Apache Geode (incubating)
© Copyright 2014 Pivotal. All rights reserved.
JIRA - https://issues.apache.org/jira/browse/GEODEGitHub - https://github.com/apache/incubator-geodeMailing lists:
Development - dev@geode.incubator.apache.org
Users - user@geode.incubator.apache.org
Wiki - cwiki.apache.org/confluence/display/GEODEStackOverflow - http://stackoverflow.com/questions/tagged/geode+or+gemfire
Apache Geode (incubating)
‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›
SpringXDIntroduction
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts A stream is composed from modules. Each module is deployed to a container and its
channels are bound to the transport.
‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›
Apache Zeppelin(incubating)
Introduction
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
Web based REPLIterative & ExploratorySupport for Data Ingestion
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
Multi interpretersMarkdownShellSparkGeodePython…
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
Sharing through URLs without Reports
‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›
Apache SparkIntroduction
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
RDDDataframeDriverWorker
"An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes."
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
RDDDataframeDriverWorker
“A dataframe is a distributed collection of rows organized into named columns. An abstraction for selecting, filtering and plotting structured data (pandas), previously known as SchemaRDD."
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
RDDDataframe DriverWorker
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Summary
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Summary
• Integration• Spark, JDBC, Geode• HDFS, Twitter, File, Mail…
• Data pipeline orchestration• Intuitive DSL• Streaming & Analytics• Distributed and scalable
• Web based REPL• Multiple Interpreters
• Apache Spark• Markdown• Flink• Python• Geode…
• Iterative & Exploratory
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Summary
• Fast data processing• Columnar queries• RDDs• Machine Learning• Analytics & Streaming
• Fast data store and processing• In-memory & Persistent• Highly Consistent• Transaction processing• Thousands of concurrent
clients
© Copyright 2014 Pivotal. All rights reserved.
Source Codehttp://pivotal-open-source-hub.github.io/StockInference-Spark/