Big Data and FrameWorks; Perspectives to Applied Machine...

Big Data and FrameWorks;

Perspectives to Applied Machine

Learning

Mehdi Habibzadeh

PhD in Computer Science

Outlines (Oct 2016) :

Big Data and Challenges

Review and Trends

Math and Probability Concepts

Data Structure and Retrieval Algorithms

Map-Reduce on Large Clusters

Hadoop Framework Programming

Apache Spark Framework

Big Data and Cloud Computing

Big Data and NoSQL

Machine Learning (Conventional and Deep Learnings)

Big Data in the real world

2016 Big Data and Applied Machine Learning 2

Big Data and Challenges

Sources and Massive Information

Characteristics and Trends

The year 2015 was a big jump in the world of big data.

» Adoption of technologies, associated with unstructured data

» Ref : http://www.tableau.com/top-8-trends-big-data-2016?


Big Data and Challenges (Cont.)








Big Data: Math Terms

Understanding and Visualization

Missing values, Outliers values , ….

ML -Maximum Likelihood

EM-Expectation Maximization

The interquartile range (IQR)

Data Mining and Statistical Approaches

Data Dimensionality Reduction ( PCA, SFS, BFS,….)

Relevance and Redundancy (Kruskal–Wallis, Kolmogorov-Smirnov)

Regression modeling (Logistic Regression, )

Data compression (Singular value decomposition)

Variable Selection and Ranking (Eigen values/Vectors, HDMR)


Big Data: Math Terms (Cont.)

Feature selection :

Reasons and motivation

To trace effectiveness of aforementioned high dimensional invariant

descriptors in white blood cell classification performance.

To provide a smaller effective set compared to the starting data pool.

To avoid redundant or irrelevant features.

Two approaches (Wrapper - Filter) :

Wrapper: An iterative method with considering its predictive efficiency

to a given classifier (Pattern Recognition algorithm).

Filter : The objective function evaluates subsets using statistical

dependency, Regression, interclass distance (Machine Learning).


Big Data: Math Terms (Cont.)

Machine Learning and Predicting

Reliability, Uncertainty and Global Sensitivity Analysis

Clustering and Classification

Validation Method ( Cross Validation, Hold-out datasets, …. )

Graph Laplacian for clustering

Deterministic (NN, SVM , … )

Probabilistic methods ) Bayes classifier, PAM,…)

Deep Learning (Hierarchical Classification)


Big Data Search Algorithms

Cache aware and Cache oblivious model

Using CPU cache without having the size of the cache (Sort of

Machines ) Memory performance & Improvement

Adapt to arbitrary memory hierarchies

Data clustering

Locality of memory references is increased.

Application : Matrix multiplication, Sorting, Matrix transposition


Big Data Retrieval Algorithms

Streaming

Online Data Management

Adapt to arbitrary and unstructured Input Data

Real-Time Analytical Processing (RTAP)


Map-Reduce on Large Clusters

Motivation and Demand:

Tend to be very short, code-wise

Represent a data flow


Map-Reduce (Cont.)


Map-Reduce (Cont.)


Map-Reduce (Cont.)

Each step has one Map phase and one Reduce phase

Convert any into MapReduce pattern

Great solution for one-pass computations

Not very efficient for Multi-pass computations and algorithms


Hadoop Framework

Features :

Open Source Framework for Processing Large Data

Work on Cheap and Unreliable Clusters

Known in Companies who deal with Big Data Applications

Compatible with Java, Python and Scala


Hadoop Framework (Cont.)

MapReduce Framework

Assign work for different nodes

Hadoop Distributed File System (HDFS)

Primary storage system used by Hadoop applications.

Copies each piece of data and distributes to individual nodes

Name Node (Meta Data) and Data Nodes (File Blocks)

Redundant information ( Three times by default)

Machines in a given cluster are cheap and unreliable

Decreases the risk of catastrophic failure

» Even in the event that numerous nodes fail

Links together the file systems on different nodes to make an

integrated big file system (Parallel Processing(



Hadoop V.2 : Hadoop NextGen MapReduce (YARN)



Hadoop Programming

Java

Full control of MapReduce , Cascading (Open Java Library)

Python , Scala, Ruby

Data Retrieval / Query Language

Hive

SQL- Like Language

Pig

Data Flow Language (Simple and Out of Small Steps)

Scalding

Library built on top of Scala (Elegant Model)


Big Data Programming

R – Java- Python and Scala ( Commonly Used)

Three References : ( Recommended to Read)

https://www.linkit.nl/knowledge-

base/177/4_most_used_languages_in_big_data_projects_Java

https://www.linkit.nl/knowledge-

base/226/4_most_used_languages_in_big_data_projects_R

https://www.linkit.nl/eng/knowledge-

base/196/4_most_used_languages_in_big_data_projects_Python


https://www.linkit.nl/knowledge-base/177/4_most_used_languages_in_big_data_projects_Java

https://www.linkit.nl/knowledge-base/226/4_most_used_languages_in_big_data_projects_R

https://www.linkit.nl/eng/knowledge-base/196/4_most_used_languages_in_big_data_projects_Python



Apache Spark Framework

Spark Features (More than Distributed Processing)

Ease of use, and sophisticated analytics

In-memory data storage and near real-time processing

Holds intermediate results in memory

Store as much as data in memory and then goes to disk

Spark vs Hadoop

On top of existing HDFS

Data sets that are diverse in nature (Text, Videos, …)

Variety in source of data (Batch v. real-time streaming data).

100 times faster in memory, 10 times faster when running on disk.


Apache Spark Framework (Cont.)


Apache Spark Framework (Cont.)

Compatible with Java, Scala and Python

Perform Data Analytics and Machine Learning

SQL Queries, Streaming Data

Machine Learning and Graph Data Processing

Spark MLlib, Spark’s Machine Learning library

Spark and data stored in a Cassandra database


Big Data and Cloud

Cloud Computing Platform & Services

(Cloudera, Hortonworks, MapR, Azure)


Big Data and NoSQL

Key-values Stores

Unique key and a pointer to a particular item of data.

Tokyo Cabinet/Tyrant, Redis, Voldemort, Oracle BDB, Amazon

SimpleDB, Riak

Column Family Stores

Very large amounts of data distributed over many machines.

Cassandra, HBase


Big Data and NoSQL (Cont.)

Document Databases

Similar to key-value stores,

Semi-structured documents are stored in formats like JSON

Allowing nested values associated with each key.

Document databases support querying more efficiently.

CouchDB, MongoDb



Graph Database

Flexible graph model

Instead of tables of rows and columns and the rigid structure

of SQL

Scale across multiple machines (Scale Out)

Neo4J, InfoGrid, Infinite Graph, Titan



JASON Format


RDBMS NoSQL

Databae Database

Table, View Collection

Row Document (JSON, BSON)

Column Field

Index Index

Join Embedded Document

Foreign Key Reference

Partition Shard

> db.user.findOne({age:39})

{

"_id" : ObjectId("5114e0bd42…"),

"first" : "John",

"last" : "Doe",

"age" : 39,

"interests" : [

"Reading",

"Mountain Biking ]

"favorites": {

"color": "Blue",

"sport": "Soccer"}

}



Machine Learning

Conventional Methods

Feature Extraction and Selection as an Input and

Proposed machine as a Classifier

Sample ML Methods :

Support Vector Machines (SVM)

Naive Bayes Classifier

Artificial Neural Network (ANN)


Machine Learning (Cont.)

Support Vector Machine (SVM)

Kernel Settings (Linear, polynomial and Gaussian )

Number of features is compared to the training sample.

Less prone to over fitting than alternative choice.

Soft-Margin and Hard Margin.

Over fitting controlled by soft margin (Slack variables εi)

One-versus-all.

Well in practice ( highest response)

K Fold - cross validation(Validation data)


Machine Learning : Deep Learning

Supervised & Unsupervised approaches

Greedy layer-wise unsupervised pre-training.

Hierarchy of features one level at a time,

Learn a new transformation at each level to be composed with the

previously learned transformations.

Seeking for regularities to extract an unique representation

Higher layer will find more useful than the original input

Accurate hierarchical representation of complex data

Subsequent feature extraction,

Classification problems (types and classes)


Deep Learning (Cont.)

Earliest concepts of deep learning :

Perceptron Neural Networks structures.

Neural Network technically can have more than one hidden layer.

Increasing the number of hidden layers

» Vanishing gradients, Over fitting.


Deep Learning (Cont.)

Auto-encoders, Stacked Auto-encoders, Restricted

Boltzmann Machines, The spike and slab Restricted

Boltzmann Machine (RBM), Deep Belief Networks,

Convolutional Networks


Deep :Convolution Neural Network

Extract topological invariant properties (spatially local connections

(receptive fields) ) from the gray-scale image

Especially in which input is spatially or temporally distributed

CNN is composed of two distinct parts :

Several layers are convolution and then down-sampled (Max pooling)

The second part categorizes the pattern into classes (such as RBF).

CNN consists of three different layers:

convolution layer (with different feature map), sub-sampling (max-

pooling) layer and an ensemble of fully connected layers


Convolution Neural Network (Cont.)


CNN : Recognition rate after 105 epoch, Few samples (28 per class),

Similarity between Basophil and Lymphocyte

Deep learning In Codes!

Reference :

www.deeplearning.net/software_links

Programming Language :

Python – Matlab – Java – Lua

Machine Learning in Python

Scikit-learn , Keras, Caffe , ….

Pylearn2

Machine Learning in Matlab

Torch7

Machine Learning in Java

Deeplearning4j


Machine Learning in Python


Big Data in the real world

Climate data, Large scale health care

Complex Image Processing

Personalization ( Facebook, Telegram, ….)

Advertising, Mobile Telecommunication Networks (i.e, 5G),

E-commerce and E- Banking Applications


Big Data in the real world (Cont.)

Deep Learning Algorithm Transcribes House Numbers (Google)



Car Classification using Deep Learning Approach



Banking Systems; Big Data and Deep Learning

Banknote Authentication and Forgery Detection

Financial Fraud Detection

Bank Embezzlement & Money Laundering

Boost e-commerce Sales

Losing From Disgruntled Customers

Loan Approval Prediction


Contact Info

Mehdi (Nima) Habibzadeh Motlagh

PhD in Computer Science (Concordia university , Sept 2015)

Email : [email protected]

Cell phone : +98 912 326 7046

Telegram : +1 514 632 2838


mailto:[email protected]

Date post:	20-May-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Big Data and FrameWorks; Perspectives to Applied Machine...

Documents