Date post: | 15-Jul-2015 |
Category: |
Technology |
Upload: | sessionsevents |
View: | 147 times |
Download: | 1 times |
Shiva Amiri, PhD
Chief Product Officer
MLConf Seattle - May 1st 2015
Incorporating the Real Time Component into Analytics and Machine Learning
The Challenge
One or more structural limitations have significantly constrained successful data mining applications and initiatives
Frequently, these problems are associated with the amount of data, the rate of data generation and the number of attributes (variables) to be processed –
1000’s of data variables form which to model from (dimensionality) 100’s of billions of records to model data Continuously evolving data elements and changing sets of data The need to execute and adapt in Real Time
Increasingly, this “big data” environment expands beyond the capabilities of conventional data mining methods and technology
2
Source: http://www.informationweek.com/big-data/big-data-analytics/5-analytics-bi-data-management-trends-for-2015/a/d-id/1318551 -09/01/2015
What are the trends?
4
The Market Opportunity
IDC Reports Big Data Analytics market at $125 billion in 2015
Gartner reports the Internet of Things (IoT) will have 25 billion devices with
sensors connected by 2020 producing exabytes of data
IoT/E Market size by 2020 will exceed $14 trillion
Bioinformatics market is $7.5 billion according to Gartner
Streaming data, Real Time analytics and machine learning remain a
significant challenge for multiple sectors
Which verticals are we looking at?
Bioinformatics, Computational Biology – genetics, proteomics, EEG data, fMRI, Molecular Dynamics data, etc.
Financials – behaviour, signals, patterns
Internet of Everything
Other fast and massive data is what we are interested in
5
7
What kinds of questions do we want to ask? How do the genes and proteins in disorders relate
to each other – clustering, regression,
classification, etc.
What are the other factors involved in disease
onset and progression?
What about environment data? Quality of Life?
Education? Socioeconomic status? - natural
language processing (NLP), classification,
predictive modeling, etc.
How can we handle massive amounts of brain
sensing and imaging data (EEG, fMRI) and link
them to other data (genes and proteins)?
Integrative analytics
And questions we don’t know we have
RTDS’ SymetryMLTM : What have we built?
SymetryML™ is a distributed GPU-implemented predictive analysis and modeling technology for our Massive Data universe…
V3.5 released – real time analytics of large-scale data
Exploration(statistics) and model building, assessment and prediction in real time
Robust security and privacy features
V4.0 being developed – distributed computing capability
9
How is SymetryML™ addressing these challenges?
The V’s of Big Data SymetryMLTM can handle heavy volumes of data (Volume)
SymetryMLTM can handle streaming data (Velocity)
Accelerated hardware with GPUs and distributed computing
REST API – flexibility and modular design, seamless integration into existing systems or development of custom systems
Simplicity of the design
Real Time analytics – exploration and model generation/prediction, handling massive data with unprecedented speed in real time
Privacy and security
Service Oriented Architecture – XaaS
11
Faster: In minutes SymetryMLTM can utilize 10,000’s+ variables by constructing 1000’s of model
combinations and ultimately reduce variables to a single model - builds models in real time as
it learns
Smarter with Scale: Linearly scalable with zero limitation in length of data sets and depth of
categorical data allows for unlimited learning from data
More Agile on-the-fly: Continuous learning, both distributed and parallel
Simply Deployed: SymetryMLTM models can be deployed in real time or in the form of scripts
(SQL, Java, etc.)
Proprietary Statistical Representation
Data
Learner Modeler
Predictor
Explorer
12
Parallel Processing/Distributed
Computing
Incremental/Decremental
Learning
(no rescan)
Automated Variable Selection
Add variables on-the-fly
SymetryML™
A few key features
Component Technologies
Component
Web UI
REST API
Core functionalities
NVIDIA GPU support
Project
sym-web
sym-rest
sym-core
sym-core
Language
JavaScript
Java
Java
C/C++
SymetryML™-COREBasic Functionality:
Learn / Forget data
Univariate Analysis – Mean, StDev, F Test, Z Test, T Test,
Bivariate Analysis
Correlation
Hypothesis Testing
Chi-square Testing
ANOVA
Model Selection and Creation
Predictions
Assessment
Persistence
RTDS Inc. – Headlines
Team of 6 engineers and Data Scientists in Toronto, Board in NY Focus on Technology Differentiation
Technology timeline March ’13 – Launched .NET Based Desktop Version
July ’13 – Launched SymetryMLTM Server with REST API.
December ’13 – Successfully deployed first GPU-based system
June ‘14 – Algorithmic Support Expanded
’15 Roadmap: Aggressive, Attainable and Defensible
Proven technology with successful deployment in advertising
Current Financing Mogility Capital
19
Next steps
We’ve been successful with this technology in the mobile advertising space…now we want to use the power of this technology in other strategic sectors
We are looking for partners as beta users - with unique datasets and use cases - what kinds of questions can we help answer with your data?
We are looking for integration partners where we can both enhance our offering
Develop the next version (v4.0) of SymetryMLTM – fully parallel with Apache Spark
20
Thank you
www.rtdsinc.com
21
Contact
SymetryMLTM and
GPUs
• Native library that uses NVIDIA GPUs are available for:
• Linux 64 bit (CentOS 5.x and Amazon Linux)
• Use of GPUs for core operations:• Learning / Forgetting data
• Model Building
• Model Selection
• Interactive HTML 5 application
• Direct connection to SYM-REST
• It is de-facto a light weight front-end to SYM-REST
• Based on Sencha Ext-JS 4.x
SymetryMLTM-WEB
• Provides a Restful API to sym-core.
• Supported Data Sources:
• Amazon S3
• SFTP
• HTTP/HTTPS
• Redshift
• Upcoming Data Sources:
• HDFS
• ODBC/JDBC
SYM-REST
• User of the rest-API needs an access key
• We generate these keys
• Key is AES 128 bits.
• Every REST request is authenticated with a HMAC
(SHA1) code based on part of the request
• If data encryption is needed, then usage of HTTPS
is possible
SYM-REST Security
Finance data example
• NASDAQ TotalView-ITCH Intraday Data Modeling
175Gb - one month of raw data
55Gb of transactions for NASDAQ100 constituents
12M rows/400 attributes
Univariate analysis across securities
Covariance and Hypothesis Testing
Model Building: Classification/Regression
Prediction of Price Movement
Full Order Book Analysis
27