Big Data and FrameWorks;
Perspectives to Applied Machine
Learning
Mehdi Habibzadeh
PhD in Computer Science
Outlines (Oct 2016) :
Big Data and Challenges
Review and Trends
Math and Probability Concepts
Data Structure and Retrieval Algorithms
Map-Reduce on Large Clusters
Hadoop Framework Programming
Apache Spark Framework
Big Data and Cloud Computing
Big Data and NoSQL
Machine Learning (Conventional and Deep Learnings)
Big Data in the real world
2016 Big Data and Applied Machine Learning 2
Big Data and Challenges
Sources and Massive Information
Characteristics and Trends
The year 2015 was a big jump in the world of big data.
» Adoption of technologies, associated with unstructured data
» Ref : http://www.tableau.com/top-8-trends-big-data-2016?
2016 Big Data and Applied Machine Learning 3
Big Data and Challenges (Cont.)
2016 Big Data and Applied Machine Learning 4
Big Data and Challenges (Cont.)
2016 Big Data and Applied Machine Learning 5
Big Data and Challenges (Cont.)
2016 Big Data and Applied Machine Learning 6
Big Data and Challenges (Cont.)
2016 Big Data and Applied Machine Learning 7
Big Data: Math Terms
Understanding and Visualization
Missing values, Outliers values , ….
ML -Maximum Likelihood
EM-Expectation Maximization
The interquartile range (IQR)
Data Mining and Statistical Approaches
Data Dimensionality Reduction ( PCA, SFS, BFS,….)
Relevance and Redundancy (Kruskal–Wallis, Kolmogorov-Smirnov)
Regression modeling (Logistic Regression, )
Data compression (Singular value decomposition)
Variable Selection and Ranking (Eigen values/Vectors, HDMR)
2016 Big Data and Applied Machine Learning 8
Big Data: Math Terms (Cont.)
Feature selection :
Reasons and motivation
To trace effectiveness of aforementioned high dimensional invariant
descriptors in white blood cell classification performance.
To provide a smaller effective set compared to the starting data pool.
To avoid redundant or irrelevant features.
Two approaches (Wrapper - Filter) :
Wrapper: An iterative method with considering its predictive efficiency
to a given classifier (Pattern Recognition algorithm).
Filter : The objective function evaluates subsets using statistical
dependency, Regression, interclass distance (Machine Learning).
2016 Big Data and Applied Machine Learning 9
Big Data: Math Terms (Cont.)
Machine Learning and Predicting
Reliability, Uncertainty and Global Sensitivity Analysis
Clustering and Classification
Validation Method ( Cross Validation, Hold-out datasets, …. )
Graph Laplacian for clustering
Deterministic (NN, SVM , … )
Probabilistic methods ) Bayes classifier, PAM,…)
Deep Learning (Hierarchical Classification)
2016 Big Data and Applied Machine Learning 10
Big Data Search Algorithms
Cache aware and Cache oblivious model
Using CPU cache without having the size of the cache (Sort of
Machines ) Memory performance & Improvement
Adapt to arbitrary memory hierarchies
Data clustering
Locality of memory references is increased.
Application : Matrix multiplication, Sorting, Matrix transposition
2016 Big Data and Applied Machine Learning 11
Big Data Retrieval Algorithms
Streaming
Online Data Management
Adapt to arbitrary and unstructured Input Data
Real-Time Analytical Processing (RTAP)
2016 Big Data and Applied Machine Learning 12
Map-Reduce on Large Clusters
Motivation and Demand:
Tend to be very short, code-wise
Represent a data flow
2016 Big Data and Applied Machine Learning 13
Map-Reduce (Cont.)
2016 Big Data and Applied Machine Learning 14
Map-Reduce (Cont.)
2016 Big Data and Applied Machine Learning 15
Map-Reduce (Cont.)
Each step has one Map phase and one Reduce phase
Convert any into MapReduce pattern
Great solution for one-pass computations
Not very efficient for Multi-pass computations and algorithms
2016 Big Data and Applied Machine Learning 16
Hadoop Framework
Features :
Open Source Framework for Processing Large Data
Work on Cheap and Unreliable Clusters
Known in Companies who deal with Big Data Applications
Compatible with Java, Python and Scala
2016 Big Data and Applied Machine Learning 17
Hadoop Framework (Cont.)
MapReduce Framework
Assign work for different nodes
Hadoop Distributed File System (HDFS)
Primary storage system used by Hadoop applications.
Copies each piece of data and distributes to individual nodes
Name Node (Meta Data) and Data Nodes (File Blocks)
Redundant information ( Three times by default)
Machines in a given cluster are cheap and unreliable
Decreases the risk of catastrophic failure
» Even in the event that numerous nodes fail
Links together the file systems on different nodes to make an
integrated big file system (Parallel Processing(
2016 Big Data and Applied Machine Learning 18
Hadoop Framework (Cont.)
Hadoop V.2 : Hadoop NextGen MapReduce (YARN)
2016 Big Data and Applied Machine Learning 19
Hadoop Framework (Cont.)
Hadoop Programming
Java
Full control of MapReduce , Cascading (Open Java Library)
Python , Scala, Ruby
Data Retrieval / Query Language
Hive
SQL- Like Language
Pig
Data Flow Language (Simple and Out of Small Steps)
Scalding
Library built on top of Scala (Elegant Model)
2016 Big Data and Applied Machine Learning 20
Big Data Programming
R – Java- Python and Scala ( Commonly Used)
Three References : ( Recommended to Read)
https://www.linkit.nl/knowledge-
base/177/4_most_used_languages_in_big_data_projects_Java
https://www.linkit.nl/knowledge-
base/226/4_most_used_languages_in_big_data_projects_R
https://www.linkit.nl/eng/knowledge-
base/196/4_most_used_languages_in_big_data_projects_Python
2016 Big Data and Applied Machine Learning 21
Hadoop Framework (Cont.)
2016 Big Data and Applied Machine Learning 22
Apache Spark Framework
Spark Features (More than Distributed Processing)
Ease of use, and sophisticated analytics
In-memory data storage and near real-time processing
Holds intermediate results in memory
Store as much as data in memory and then goes to disk
Spark vs Hadoop
On top of existing HDFS
Data sets that are diverse in nature (Text, Videos, …)
Variety in source of data (Batch v. real-time streaming data).
100 times faster in memory, 10 times faster when running on disk.
2016 Big Data and Applied Machine Learning 23
Apache Spark Framework (Cont.)
2016 Big Data and Applied Machine Learning 24
Apache Spark Framework (Cont.)
Compatible with Java, Scala and Python
Perform Data Analytics and Machine Learning
SQL Queries, Streaming Data
Machine Learning and Graph Data Processing
Spark MLlib, Spark’s Machine Learning library
Spark and data stored in a Cassandra database
2016 Big Data and Applied Machine Learning 25
Big Data and Cloud
Cloud Computing Platform & Services
(Cloudera, Hortonworks, MapR, Azure)
2016 Big Data and Applied Machine Learning 26
Big Data and NoSQL
Key-values Stores
Unique key and a pointer to a particular item of data.
Tokyo Cabinet/Tyrant, Redis, Voldemort, Oracle BDB, Amazon
SimpleDB, Riak
Column Family Stores
Very large amounts of data distributed over many machines.
Cassandra, HBase
2016 Big Data and Applied Machine Learning 27
Big Data and NoSQL (Cont.)
Document Databases
Similar to key-value stores,
Semi-structured documents are stored in formats like JSON
Allowing nested values associated with each key.
Document databases support querying more efficiently.
CouchDB, MongoDb
2016 Big Data and Applied Machine Learning 28
Big Data and NoSQL (Cont.)
Graph Database
Flexible graph model
Instead of tables of rows and columns and the rigid structure
of SQL
Scale across multiple machines (Scale Out)
Neo4J, InfoGrid, Infinite Graph, Titan
2016 Big Data and Applied Machine Learning 29
Big Data and NoSQL (Cont.)
JASON Format
2016 Big Data and Applied Machine Learning 30
RDBMS NoSQL
Databae Database
Table, View Collection
Row Document (JSON, BSON)
Column Field
Index Index
Join Embedded Document
Foreign Key Reference
Partition Shard
> db.user.findOne({age:39})
{
"_id" : ObjectId("5114e0bd42…"),
"first" : "John",
"last" : "Doe",
"age" : 39,
"interests" : [
"Reading",
"Mountain Biking ]
"favorites": {
"color": "Blue",
"sport": "Soccer"}
}
Big Data and NoSQL (Cont.)
2016 Big Data and Applied Machine Learning 31
Machine Learning
Conventional Methods
Feature Extraction and Selection as an Input and
Proposed machine as a Classifier
Sample ML Methods :
Support Vector Machines (SVM)
Naive Bayes Classifier
Artificial Neural Network (ANN)
2016 Big Data and Applied Machine Learning 32
Machine Learning (Cont.)
Support Vector Machine (SVM)
Kernel Settings (Linear, polynomial and Gaussian )
Number of features is compared to the training sample.
Less prone to over fitting than alternative choice.
Soft-Margin and Hard Margin.
Over fitting controlled by soft margin (Slack variables εi)
One-versus-all.
Well in practice ( highest response)
K Fold - cross validation(Validation data)
2016 Big Data and Applied Machine Learning 33
Machine Learning : Deep Learning
Supervised & Unsupervised approaches
Greedy layer-wise unsupervised pre-training.
Hierarchy of features one level at a time,
Learn a new transformation at each level to be composed with the
previously learned transformations.
Seeking for regularities to extract an unique representation
Higher layer will find more useful than the original input
Accurate hierarchical representation of complex data
Subsequent feature extraction,
Classification problems (types and classes)
2016 Big Data and Applied Machine Learning 34
Deep Learning (Cont.)
Earliest concepts of deep learning :
Perceptron Neural Networks structures.
Neural Network technically can have more than one hidden layer.
Increasing the number of hidden layers
» Vanishing gradients, Over fitting.
2016 Big Data and Applied Machine Learning 35
Deep Learning (Cont.)
Auto-encoders, Stacked Auto-encoders, Restricted
Boltzmann Machines, The spike and slab Restricted
Boltzmann Machine (RBM), Deep Belief Networks,
Convolutional Networks
2016 Big Data and Applied Machine Learning 36
Deep :Convolution Neural Network
Extract topological invariant properties (spatially local connections
(receptive fields) ) from the gray-scale image
Especially in which input is spatially or temporally distributed
CNN is composed of two distinct parts :
Several layers are convolution and then down-sampled (Max pooling)
The second part categorizes the pattern into classes (such as RBF).
CNN consists of three different layers:
convolution layer (with different feature map), sub-sampling (max-
pooling) layer and an ensemble of fully connected layers
2016 Big Data and Applied Machine Learning 37
Convolution Neural Network (Cont.)
2016 Big Data and Applied Machine Learning 38
CNN : Recognition rate after 105 epoch, Few samples (28 per class),
Similarity between Basophil and Lymphocyte
Deep learning In Codes!
Reference :
www.deeplearning.net/software_links
Programming Language :
Python – Matlab – Java – Lua
Machine Learning in Python
Scikit-learn , Keras, Caffe , ….
Pylearn2
Machine Learning in Matlab
Torch7
Machine Learning in Java
Deeplearning4j
2016 Big Data and Applied Machine Learning 39
Machine Learning in Python
2016 Big Data and Applied Machine Learning 40
Big Data in the real world
Climate data, Large scale health care
Complex Image Processing
Personalization ( Facebook, Telegram, ….)
Advertising, Mobile Telecommunication Networks (i.e, 5G),
E-commerce and E- Banking Applications
2016 Big Data and Applied Machine Learning 41
Big Data in the real world (Cont.)
Deep Learning Algorithm Transcribes House Numbers (Google)
2016 Big Data and Applied Machine Learning 42
Big Data in the real world (Cont.)
Car Classification using Deep Learning Approach
2016 Big Data and Applied Machine Learning 43
Big Data in the real world (Cont.)
Banking Systems; Big Data and Deep Learning
Banknote Authentication and Forgery Detection
Financial Fraud Detection
Bank Embezzlement & Money Laundering
Boost e-commerce Sales
Losing From Disgruntled Customers
Loan Approval Prediction
2016 Big Data and Applied Machine Learning 44
Contact Info
Mehdi (Nima) Habibzadeh Motlagh
PhD in Computer Science (Concordia university , Sept 2015)
Email : [email protected]
Cell phone : +98 912 326 7046
Telegram : +1 514 632 2838
2016 Big Data and Applied Machine Learning 45