Date post: | 28-Nov-2014 |
Category: |
Technology |
Upload: | natalino-busa |
View: | 675 times |
Download: | 1 times |
Big Data Solutions for Marketing Analytics
Natalino Busa@natalinobusa
Parallelism Hadoop Cassandra Akka
Machine Learning Statistics Big Data
Algorithms Cloud Computing Scala Spray
Natalino Busa@natalinobusa
www.natalinobusa.com
Humanize Data
The bank statements
Back to routine.Grocery, broken washmachine
After-vacation funPancake house.
Traveling back.
Just back home. Pizza.
Shopping in SicilyVacation!
The bank statements How I read the bank bills
Back to routine.Grocery, broken washmachine
After-vacation funPancake house.
Traveling back.
Just back home. Pizza.
Shopping in SicilyVacation!
The bank statements How I read the bank bills What happened those days
data is the fabric of our livesLet’s give more meaning and context to data.
Abraham Harold Maslow (April 1, 1908 – June 8, 1970) was an American psychologist who was best known for creating Maslow's hierarchy of needs
breathing, food, water, sleep
security of body, resources, health, employment, property
friend, family, partnersecurity of love and belonging
self-esteem, confidence, achievements, respect
spontaneity, creativity, acceptance, freedom, ethics
Physiology
Contractual
Love & Caring
Esteem
Self-actualization
Very human needs
How much caring can technology be?
Connectivity, Electricity, Hardware / Infra
security of basic operationsREST APIs, Encryption, Authentication
Notification, Alerts,Social bonding, Predictions
Set goals, planning,Achievements, Advisory role
Freedom, Trusted Companion
Physiology
Contractual
Love & Caring
Esteem
Self-actualization
Technology is reaching out
Data science top 3
Dimensionality
Reduction
Predictive
Analytics
Clustering
Segmentation
Data science: what’s working?
- Random Forests
- Artificial Neural Networks
- Clustering Algorithms
- Pattern Recognition
- Time-Serie analysis
- RegressionMost actual models are a
combination of these ones
Data science ^.^/
keep it scientific
cross-validate your models
keep it measurable
play with it
create new features
explore the available data
How to code data science?
# Multiple Linear Regression Example
fit <- lm(y ~ x1 + x2 + x3, data=mydata)
summary(fit) # show results
● Language for statistics● Easy to Analyze and shape data● Advanced statistical package● Fueled by academia and professionals● Very clean visualization packages
Packages for machine learningtime serie forecasting, clustering, classification decision trees, neural networks
Remote procedure calls (RPC)From scala/java via RProcess and Rserve
Data Science: R
>>> from sklearn.datasets import load_iris>>> from sklearn import tree>>> iris = load_iris()>>> clf = tree.DecisionTreeClassifier()>>> clf = clf.fit(iris.data, iris.target)
● Flexible, concise language● Quick to code and prototype● Portable, visualization libraries
Machine learning libraries:scipy, statsmodels, sklearn, matplotlib, ipython
Web librariesflask, tornado, (no)SQL clients
Data Science: Python
Earn the trust
The customer’s context
Personal history: amount of transactions ever done
Long term Interaction:how the users’ action correlate with others
Real time events:Trends and recent events
The customer’s context
context is related to time:
slow changing: the defining characteristic of a person
fast changing: events which influence our lives, trends
Require very different technology solutions !!!
Challenges
Not much time to reactEvents must be delivered fast to the new machine APIsIt’s Web, and Mobile Apps: latency budget is limited
Loads of information to processUnderstand well the user historyAccess a larger context
Big Data and Fast data
ranking and preference
segmentation and clustering
short term trending topics
rule-based recommendations
10’s Terabytes of Data. This can take hours ….
100’s of events per second.This must be fast ….
Back to the drawing board
core banking systems
SOAP services and DBs
System BUS
customer facing appls
channels
A high-level bank schematic
Higher separation !
Less silos
Interactions
with core
systems
Bigger and Faster
Human-centric applications
Some techs
Hadoop: Distributed Data OS
ReliableDistributed, Replicated File System
Low cost↓ Cost vs ↑ Performance/Storage
Computing Powerhouse
All clusters CPU’s working in parallel for running queries
Cassandra: A low-latency 2D store
ReliableDistributed, Replicated File System
Low latencySub msec. read/write operations
Tunable CAPDefine your level of consistency
Data model: hashed rows, sorted wide columns
Architecture model: No SPOF, ring of nodes, omogeneous system
Scala / Akka / Spray: a WEB API reactive framework
ActorA Actor
B
ActorC
msg 1msg 2
msg 3
msg 4● it scales horizontally (can run in cluster mode)
● maximum use of the available cores/memory
● processing is non-blocking, threads are re-used
● can parallelize computing power across many actors
Very fast: 1000’s messages/sec
Very reliable: auto recovery
Lazy: compute only when required
Putting it all together
Hadoop
application (actor based)
millions of millions of
λ= conversions
( lamda )Data queues
Science & Engineering
Statistics, Data Science
PythonRVisualization
IT InfraBig Data
JavaScalaSQL
Hadoop: Big Data Infrastructure, Data Science on large datasets
Big Data and Fast Data requires different profiles to be able to achieve the best results
Some lessons learned
● Mix and match technologies is a good thing● Fast Data must complement Big Data● Ease integration among teams● Hadoop, Cassandra, and Akka● Data Science takes time to figure out
Parallelism Mathematics Programming
Languages Machine Learning Statistics
Big Data Algorithms Cloud Computing
Natalino Busa@natalinobusa
www.natalinobusa.com
Thanks !Any questions?