+ All Categories
Home > Documents > CPEE Big Data Analytics and Optimization Curriculum

CPEE Big Data Analytics and Optimization Curriculum

Date post: 14-Dec-2014
Category:
Upload: kaushik-madaka
View: 33 times
Download: 2 times
Share this document with a friend
Description:
Course for self prep
16
Big Data Analytics and Optimization Certificate Program in Engineering Excellence Certificate in Accelerated Engineering M.Tech. (GITAM University) – Applied Computer Science and Technology
Transcript
Page 1: CPEE Big Data Analytics and Optimization Curriculum

Big Data Analytics and Optimization Certificate Program in Engineering Excellence

Certificate in Accelerated Engineering

M.Tech. (GITAM University) – Applied Computer Science and Technology

Page 2: CPEE Big Data Analytics and Optimization Curriculum

INTERNATIONAL SCHOOL OF ENGINEERING http://www.insofe.edu.in

LIST OF COURSES

ESSENTIALS OF APPLIED PREDICTIVE ANALYTICS ......................................................................................... 2

STATISTICAL MODELING FOR PREDICTIVE ANALYTICS IN ENGINEERING AND BUSINESS .......... 4

EFFECTIVE DECISION MAKING: OPTIMIZATION, SIMULATION AND STATISTICAL METHODS ... 6

ENGINEERING BIG DATA WITH R AND HADOOP ECOSYSTEM .................................................................... 8

TEXT MINING AND SOCIAL MEDIA ANALYTICS ............................................................................................... 10

METHODS AND ALGORITHMS IN MACHINE LEARNING ............................................................................... 12

ADVANCED TOPICS IN MACHINE LEARNING .................................................................................................... 14

ARCHITECTING DATA ANALYTICS SOLUTIONS IN THE REAL WORLD ................................................. 15

Page 3: CPEE Big Data Analytics and Optimization Curriculum

INTERNATIONAL SCHOOL OF ENGINEERING http://www.insofe.edu.in

CSE 7301c

Essentials of Applied Predictive Analytics

This five-day module teaches the complete data analytics lifecycle in an applied and

hands-on manner. A data-rich business environment is detailed and a few semi-real

world problems that can be solved in 5 days are worked on. It starts with playing with

data, using data visualization as an analytics technique and data pre-processing. It then

smoothly moves to designing and implementing predictive models for a variety of

business applications. It also covers important aspects of analyzing the quality of the

model. Finally, the latest trends in reporting the results are discussed.

While one or two business cases are used as anchoring themes during the program, the

general applicability is emphasized throughout.

At the end of the program, the participants are able to answer business questions such

as “who is likely to buy a new product amongst the existing customers”, “which

customers are most likely to default on a loan or an insurance payment” and “if a

customer buys Product A, which other products can be recommended to him/her”.

This course thoroughly trains candidates on the following techniques:

A framework for solving Analytics problems.

Pre-processing Techniques: Graphical visualization; Handling missing values;

Data standardization

Introduction to two important data mining techniques: Decision Trees and

Association Rules

A thorough introduction to solving analytics problems using R

Model selection using K-fold validation

Day 1

Introduction: Big picture of Data Sciences

Understanding the “business case” and defining a solution framework

Getting the data into R environment: Reading data as a Data frame, Matrix,

Vector and a List; Visualization: Various plots and their purpose (Scatter, Bar,

Pie, Box, Histograms, and Surface and Contour graphs)

Pre-processing the data: Binning; Normalizing; Imputation; Removing noise and

outliers

Day 2

Data Pre-processing - continued

Traps and Errors: Confusion Matrix, Analyzing False Positives and False

Negatives from a problem perspective, Different error measures used in

forecasting

Model selection: K-fold validation

Introduction to Decision Trees and their structure

Page 4: CPEE Big Data Analytics and Optimization Curriculum

INTERNATIONAL SCHOOL OF ENGINEERING http://www.insofe.edu.in

Day 3

Construction of Decision Trees through simplified examples; Choosing the "best"

attribute at each non-leaf node; Entropy; Information Gain

Generalizing Decision Trees; Information Content and Gain Ratio; Dealing with

numerical variables; Other measures of randomness

Issues in Inductive learning: Curse of Dimensionality, Overfitting, Bias-Variance

tradeoff

Pruning a Decision Tree; Cost as a consideration; Unwrapping Trees as rules

Day 4

A mathematical model for association analysis; Large itemsets; Association Rules

Apriori: Constructs large itemsets with minsup by iterations

Interestingness of discovered Association Rules; Examples; Association Analysis

vs. Classification

Using Association Rules to compare stores; Dissociation Rules; Sequential

Analysis Using Association Rules

Day 5

The last 4 days covered enough techniques and process for handling a complex analytics

problem. On the last day, all of it is brought together for a coherent story.

Data visualization and Story-telling: Anatomy of a graph

Animated graphs, BI dashboards and the latest trends in data visualization

Industry exposure: A webinar by an industry expert about how they are using

analytics in the real world

Page 5: CPEE Big Data Analytics and Optimization Curriculum

INTERNATIONAL SCHOOL OF ENGINEERING http://www.insofe.edu.in

CSE 7302c

Statistical Modeling for Predictive Analytics in Engineering and Business

This six day module is aimed at teaching “how to think like a statistician”. “Statistical

thinking will one day be as necessary for efficient citizenship as the ability to read and

write”, wrote H. G. Wells in the year 1895. That day and age has arrived with Data

Analytics going mainstream (For Today’s Graduate, Just One Word: Statistics -

http://www.nytimes.com/2009/08/06/technology/06stats.html). This course teaches this

very important and essential skill. Broadly, the following aspects are covered:

Studying the data systematically and gaining intuition about variables and their

inter-relationships

Applied statistical methods to extract hidden relations and patterns from the data

By the end of the course, the participants will be able to answer questions like “what will

be the price of a commodity at a future point of time”, “if a sample of 100 components

have the dimension of 100 nanometers, what can I say about the dimension of the

population of 100,000 components”, etc. Data sets from Retail, Finance, Manufacturing

and Healthcare industries are used to explain the concepts.

This course thoroughly trains candidates on the following techniques:

Probability distribution analysis, Correlations and ChiSquare testing

Linear regression, Multilinear regression and Logistic regression

Clustering

Time series analysis

Non-parametric statistics

From a tools perspective, you will gain confidence with tools like R and Excel for creating

meaningful and information rich dashboards.

Day 1

Computing the properties of an attribute: Central tendencies (Mean, Median,

Mode, Range, Variance, Standard Deviation); Expectations of a Variable; Moment

Generating Functions

Describing an attribute: Probability distributions (Discrete and Continuous) -

Bernoulli, Binomial, Multinomial and Poisson distributions

Describing the relationship between attributes: Covariance; Correlation;

ChiSquare

Day 2

Describing a single variable continued: Weibull, Geometric, Negative Binomial,

Gamma and Exponential distributions; Special emphasis on Normal distribution;

Central Limit Theorem

Inferential statistics: How to learn about the population from a sample and vice

versa; Sampling distributions; Confidence Intervals, Hypothesis Testing

Page 6: CPEE Big Data Analytics and Optimization Curriculum

INTERNATIONAL SCHOOL OF ENGINEERING http://www.insofe.edu.in

Day 3

Multivariate normal distributions

Types of clusters; Different clustering methods; K-Means; K-Medoids

Iterative distance-based clustering; Dealing with discrete values in K-Means

Constructing a hierarchical clustering using K-Means

Day 4

Regression (Linear, Multivariate Regression) in forecasting

Analyzing and interpreting regression results

Logistic Regression

Day 5

Trend analysis and Time Series

Cyclical and Seasonal analysis; Box-Jenkins method

Smoothing; Moving averages; Auto-correlation; ARIMA – Holt-Winters method;

GARCH

VaR; Applications of Time Series in financial markets

Day 6

Non-parametric statistics

ANOVA

Survival analysis in equipment operations

Industry exposure: A webinar by an industry expert about how they use

statistical data analysis in the real world

Page 7: CPEE Big Data Analytics and Optimization Curriculum

INTERNATIONAL SCHOOL OF ENGINEERING http://www.insofe.edu.in

CSE 7303c

Effective Decision Making: Optimization, Simulation and Statistical

Methods

This module is designed to enhance your decision capabilities when confronted with

strategic choices. You learn techniques of turning real-world problems into mathematical

models. It teaches three classes of models: Optimization, Simulation and Statistical.

The application areas originate from problems in finance, marketing and operations.

At the end of the program, you will be able to answer questions like “should I outsource

a service or do it in-house”, “how to optimize a supply chain”, and “how to price a

product when faced with demand uncertainty”.

This course thoroughly trains students in the following techniques:

Multi-criteria decision analysis

Linear, Integer, Binary and Quadratic programming

Data envelopment analysis; Goal and multi-objective modeling

Genetic Algorithms

Simulations in decision analysis: Monte Carlo and Markov Chain methods

Game theory and strategy

From a tools perspective, this course trains you on building your own R code and you are

provided R codes for a host of problems mentioned above. The course is anchored on a

large financial and mutual fund company and techniques for solving a variety of

problems they face are provided.

Day 1

Introduction to the business problem

Multi-criteria decision making for the CEO: Scientific decision making, Value of

information

Analytic hierarchy process

Strategy and game theory in analytics and decision analysis

Day 2

A system for advising the clients on right investment - A COO’s problem - Linear

programming: Applications, Graphical analysis, Sensitivity and Duality analyses

Worked-out examples in helping customers identify right portfolio, planning cash

transport and employee assignment

Comparing the performance of various offices: A CMO’s problem and the data

envelopment analysis

Setting up a new office in a different city: Goal programming and Multi-objective

programming

Page 8: CPEE Big Data Analytics and Optimization Curriculum

INTERNATIONAL SCHOOL OF ENGINEERING http://www.insofe.edu.in

Day 3

Goal programming and Multi-objective programming - Continued

Minimizing the risk - A CRO’s problem - Graphical representation of Maxima,

Minima, Point of inflection and Saddle points in single and multivariable functions

Derivative, Gradient and Hessian; Optimization with constraints; Lagrange

multipliers

Quadratic programming formulation and applications in portfolio analytics

Day 4

Minimizing travel costs: Solving non-convex problems

Monte Carlo essentials and making quick estimates

Markov Chains and generating samples from complex scenarios

Metropolis-Hastings algorithms; Simulated Annealing; Minimizing travel distance

of the mutual funds

Genetic Algorithms: The algorithm and the process

Day 5

Representing data for a Genetic Algorithm

Why and how do Genetic Algorithms work?

Industry exposure: A webinar by an industry expert about how they are using

analytics in the real world

Page 9: CPEE Big Data Analytics and Optimization Curriculum

INTERNATIONAL SCHOOL OF ENGINEERING http://www.insofe.edu.in

CSE 7304c

Engineering Big Data with R and Hadoop Ecosystem

Companies collect and store large amounts of data during daily transactions. This data

is both structured and unstructured. The volume of the data being collected has grown

from MB to TB in the past few years and is continuing to grow at an exponential pace.

The very large size, lack of structure and the pace at which it is growing characterize the

“Big Data”.

To analyze long-term trends and patterns in the data and provide actionable intelligence

to managers, this data needs to be consolidated and processed in specialized processes;

those techniques form the core of the module.

The use cases for the program are "analyzing a customer in near real-time" as applied in

Retail, Banking, Airlines, Telecom or Gaming industries. At the end of the program, the

participants will be able to set up a Hadoop cluster and write a Map Reduce program that

uses pre-built libraries to solve typical CRM data mining tasks like recommendation

engines.

This course thoroughly trains candidates on the following techniques:

HQL querying & PIG Latin Scripting (with a focus on statistical analysis)

Hadoop and Map Reduce methods of programming

Columnar (No-SQL) databases

From a tools perspective, this course introduces you to Hadoop. You will learn one of

the most powerful combinations of Big Data, viz., “R and Hadoop”.

In addition, all the essential content required to build powerful Big Data processing

applications and to acquire Hadoop certifications will be covered in the course. The

emphasis is not on abstract theory or on mindless coding. The concepts and the real-

world programming techniques are emphasized.

Day 1

Big Data – an Introduction

Parallel and Distributed Computing

Hadoop: An overview

Installing and starting to play with Hadoop

Day 2

On this day, the course gives an exciting motivation for learning Big Data. Common and

special algorithms are taught in a specific business problem context and understand

about Hadoop Ecosystem

Linux and Java refresher

Algorithms for real-world problems well-suited to Hadoop - Standard algorithms:

Sorting, Searching, Indexing, Concurrent Algorithms

Hadoop usage in real-world

HDFS Architecture

Hadoop Ecosystem I : HBase, Hive, Pig, Chukwa, Avro, Flume and Zookeeper

Demo: Data analysis using Hive and Pig

Page 10: CPEE Big Data Analytics and Optimization Curriculum

INTERNATIONAL SCHOOL OF ENGINEERING http://www.insofe.edu.in

Day 3

During the main part of the course, you will learn the fundamental concepts of Map

Reduce with detailed explanation.

Introduction to Map Reduce

Programming methodologies and paradigms in Map Reduce

Understanding the concepts of Graph Algorithms and Page Rank

Beyond basics: The flow; APIs; Driver; Mapper; Reducer

Demo: Compiling and running basic Java Map Reduce code, Hadoop configuration

parameters & logs.

Day 4

On this day, you will learn how to work with Map Reduce with practical aspects

Map-side and Reduce-side Joins; Secondary Sort

Page Rank in Map Reduce

Practical Aspects of Map Reduce Implementation, Streaming

Demo: Hadoop streaming, More realistic Map-Reduce code walk-through and

execution.

Hadoop Ecosystem II: Sqoop, Mahout, Whirr, Hama and Oozie

Demos on Hadoop Ecosystem: Sqoop, Mahout

R-Hadoop: An overview

Demo: R-Hadoop:”RHDFS”

Day 5

On the last day: Covers Hadoop certification aspects and hands on assignments.

Overview of Hadoop certifications

Hands-on-in-class assignment where students can use their choice of

Mapreduce/Hive/Pig/RHadoop/Streaming to code a new problem.

Page 11: CPEE Big Data Analytics and Optimization Curriculum

INTERNATIONAL SCHOOL OF ENGINEERING http://www.insofe.edu.in

CSE 7206c

Text Mining and Social Media Analytics

This module teaches two of the most important applications of analytics in high tech

industries.

Text mining: Unstructured data comprises more than 80% of the stored

business information (primarily as text). This helped text mining emerge as a

leading-edge technology. This module describes practical techniques for text

mining, including pre-processing (tokenization, part-of-speech tagging),

document clustering and classification, information retrieval, search and

sentiment extraction in a business context.

Predictive modeling with social network data: Social network mining is

extremely useful in targeted marketing, on-line advertising and fraud

detection. The course teaches how incorporating social media analysis can help

improve the performance of predictive models.

By the end of the course, you will be able to answer questions like “how to classify or tag

a document into a category”, “how to rank some people in a network as more likely

customers than others”, etc.

In terms of techniques, the course teaches:

Text pre-processing

Bag-of-words and Text Similarity measures

Page Rank; Neighbor analysis on predictive modeling

This course uses packages like R, WEKA and R-Hadoop for demonstrating real world

examples.

Day 1

Unstructured vs. semi-structured data; Fundamentals of information retrieval

Properties of words; Vector space models; Creating Term-Document (TxD)

matrices; Similarity measures

Low-level processes (Sentence Splitting; Tokenization; Part-of-Speech Tagging;

Stemming; Chunking)

Day 2

Text classification and feature selection: How to use Naïve Bayes classifier for

text classification

Evaluation systems on the accuracy of text mining

Sentiment Analysis

Day 3

Fundamentals of web search

A detailed analysis of Page Rank

Page Rank in social network analysis

Analyzing social networks for targeted marketing and fraud detection

Page 12: CPEE Big Data Analytics and Optimization Curriculum

INTERNATIONAL SCHOOL OF ENGINEERING http://www.insofe.edu.in

Day 4

Natural Language Analysis

Discussion of text mining tools and applications

Industry exposure: A webinar by an industry expert about how they are using

analytics in the real world

Page 13: CPEE Big Data Analytics and Optimization Curriculum

INTERNATIONAL SCHOOL OF ENGINEERING http://www.insofe.edu.in

CSE 7305c

Methods and Algorithms in Machine Learning

This module discusses the principles and ideas underlying the current practice of data

mining and introduces a powerful set of useful data analytics tools (such as K-Nearest

Neighbors, Neural Networks, etc.). Real-world business problems are used for practice.

In addition, for each of the techniques, both the traditional approach and the Big Data

approach are taught.

At the end of the course, the student will be able to answer questions like “which

technique is likely to work under what situations”, “how to handle fraud detection” and

“how to recognize handwriting”.

From techniques perspective, the student learns:

Bayesian analysis and Naïve Bayes classifier

Neural Networks

K-Nearest Neighbors

Association Rules, Dimensionality reduction using Principal Component Analysis

(PCA), Single Vector Decomposition (SVD)

Ensemble and Hybrid methods

A fictitious courier company is taken as an example and issues faced in this industry are

solved.

Day 1

Business problem and solution architecture

Motivation for Neural Networks and its applications

Perceptron and Single Layer Neural Network, and hand calculations

Learning in a Neural Net: Back propagation and conjugant gradient techniques

Application of Neural Net in Face and Digit Recognition

Day 2

Self Organizing Maps (SOM)

Computational geometry; Voronoi diagrams

K-Nearest Neighbor method

Wilson editing and triangulations

K-nearest neighbors in collaborative filtering, digit recognition

Day 3

Representing data in a matrix form; Bases and thinking of attributes as bases;

Orthogonality and Orthonormality; Linear independence of axes

Transformation matrices and Eigen vectors as transformation matrices

Principal Component Analysis (PCA)

Single Vector Decomposition (SVD) and applications in Association Rules and

Latent Semantic Indexing (LSI)

Day 4

Probability fundamentals

Bayes Theorem and its applications

Becoming instinctively Bayesian

Page 14: CPEE Big Data Analytics and Optimization Curriculum

INTERNATIONAL SCHOOL OF ENGINEERING http://www.insofe.edu.in

Day 5

Naïve Bayes classifier

Ensemble and Hybrid models

o AdaBoost and Random Forests

Industry exposure: A webinar by an industry expert about how they are using

analytics in the real world

Page 15: CPEE Big Data Analytics and Optimization Curriculum

INTERNATIONAL SCHOOL OF ENGINEERING http://www.insofe.edu.in

CSE 7108c

Advanced Topics in Machine Learning

This module discusses the most advanced data mining techniques such as Support

Vector Machines (SVM), Bayesian Belief Nets, Expectation Maximization and

Reinforcement Learning. This is suited for those interested in getting into an R&D lab of

a product company or a PhD program in machine learning.

Day 1

Linear learning machines and Kernel methods in learning

VC (Vapnik-Chervonenkis) dimension; Shattering power of models

Algorithm of Support Vector Machines (SVM)

Day 2

Bayesian Belief Nets

Expectation Maximization

Day 3

Reinforcement Learning and Adaptive Control

Applications of machine learning to robotic control, data mining, autonomous

navigation, bioinformatics and speech recognition

R&D exposure: A webinar by a senior scientist about the cutting-edge

developments in analytics

Page 16: CPEE Big Data Analytics and Optimization Curriculum

INTERNATIONAL SCHOOL OF ENGINEERING http://www.insofe.edu.in

CSE 7107c

Architecting Data Analytics Solutions in the Real World

OK! The rubber meets the road! It is competition and fun time. You will actually

architect an entire solution (actually 2!). This module also helps bring all the concepts

learnt in other modules into perspective, helping students provide end-to-end solutions

to business problems.

Students are divided into groups of approximately 4 each. They are given a real world

problem with insufficient information. They are required to conduct interviews, obtain

the information, design a solution, and come up with an implementation plan.

Days 1, 2 and 3

The students get the problem a day prior to the start of this module. Each team works

through the problem, and comes up with a solution architecture and effort estimates. In

addition, there are at least two presentations by industry experts from consulting,

insurance, retail, services and/or financial industries.


Recommended