+ All Categories
Home > Documents > Data Science End to End€¦ · should focus on the following four pillars to have a successful...

Data Science End to End€¦ · should focus on the following four pillars to have a successful...

Date post: 09-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
12
Demand for deep analytical talent in the United States could be 50 to 60 percent greater than its projected supply by 2018 Source: The McKinsey Global Institute April 2018 BBDS Inc. Data Science End to End Big Bang Data Science Solutions 13 Week DS Program 5th Batch
Transcript
Page 1: Data Science End to End€¦ · should focus on the following four pillars to have a successful journey toward analytics maturity. These four pillars are: Data scientists are like

Demand for deep analytical talent in the United States could be 50 to 60 percent greater than its projected supply by 2018

Source: The McKinsey Global Institute

April 2018BBDS Inc.

Data scientists are like vehicle restoration engineers who investigate, plan and build a car!

To these Data Scientists, "Data" is analogous to scattered pieces of metal, some clean and some hidden in the dirt. If all the pieces are identiÿed, collected and assembled correctly, it will resemble a skeletal framework of the car the engineers have to construct/ restore. Some of these pieces would ÿt perfectly, but some may be broken in a way that they cannot be welded/ glued back together and some pieces may even be missing. However, if the engineers put these pieces together with care, due diligence and creativity, they’d restore the car, similar to its glory days.

In Data Science, a variety of measure-ments are used to reconstruct and represent the phenomenon of interest as accurately as possible so that we can control and improve it.The Data Infrastructure and Toolsare like the automotive parts (metal and plastic) that ÿt/ run throughout the skeletal framework once these pieces are collected, reÿned and arranged (though not perfectly at this point). These parts collectively provide agility and rigidity to the skeletal frame, enabling it to do interesting things. They can make the “boring data” come to life, and express itself in many interesting ways though not yet in a useful/ meaningful way.The Analytics Algorithms collectively mimics the function of an engine. A relatively small mass of the whole thing, it is however the entity through which all the moving parts (including the car itself ) are controlled to perform interesting maneuvers/ tasks, which may be unique to that particular car. In the end, what was initially a scattered pile of parts (metal and plastic), now move around in perfect harmony, doing interesting things.

That is what Data Scientists and Analytics Experts try to do…

Give life to the often “boring data”, similar to the engineers giving life to scattered car parts!

Data Science End to EndBig Bang Data Science Solutions

13 Week DS Program 5th Batch

Page 2: Data Science End to End€¦ · should focus on the following four pillars to have a successful journey toward analytics maturity. These four pillars are: Data scientists are like

Table of Contents

1. Introduction

2. Course Overview

3. Course Information

4. What You Get in Return ?

5. Your Portfolio

6. Course Curriculum

7. Course Fee

8. Instructor

9. About BBDS

www.bigbang-datascience.com

Data scientists are like vehicle restoration engineers who investigate, plan and build a car!

To these Data Scientists, "Data" is analogous to scattered pieces of metal, some clean and some hidden in the dirt. If all the pieces are identiÿed, collected and assembled correctly, it will resemble a skeletal framework of the car the engineers have to construct/ restore. Some of these pieces would ÿt perfectly, but some may be broken in a way that they cannot be welded/ glued back together and some pieces may even be missing. However, if the engineers put these pieces together with care, due diligence and creativity, they’d restore the car, similar to its glory days.

In Data Science, a variety of measure-ments are used to reconstruct and represent the phenomenon of interest as accurately as possible so that we can control and improve it.The Data Infrastructure and Toolsare like the automotive parts (metal and plastic) that ÿt/ run throughout the skeletal framework once these pieces are collected, reÿned and arranged (though not perfectly at this point). These parts collectively provide agility and rigidity to the skeletal frame, enabling it to do interesting things. They can make the “boring data” come to life, and express itself in many interesting ways though not yet in a useful/ meaningful way.The Analytics Algorithms collectively mimics the function of an engine. A relatively small mass of the whole thing, it is however the entity through which all the moving parts (including the car itself ) are controlled to perform interesting maneuvers/ tasks, which may be unique to that particular car. In the end, what was initially a scattered pile of parts (metal and plastic), now move around in perfect harmony, doing interesting things.

That is what Data Scientists and Analytics Experts try to do…

Give life to the often “boring data”, similar to the engineers giving life to scattered car parts!

Advanced Analytics market to grow from $7.04 billion in 2014 to $29.53 billion in 2019

Source: Markets and Markets

Global BI and Analytics market would grow to $22.8 billion by 2020

Source: Gartner

Page 3: Data Science End to End€¦ · should focus on the following four pillars to have a successful journey toward analytics maturity. These four pillars are: Data scientists are like

BBDS 13 Week Program

5th Batch

Why Data Science ?

1. Introduction

Congratulations! You are interested in learning everything necessary to become a Data Scientist and join the ranks of the fastest growing jobs in IT.

"Data science" wasn't even a term 10 years ago. And now everyone is asking the same question. What's the deal? We are here to help you sort through the data.

Data science is the new happening. Companies in every industry are recognizing the strategic importance of using data to stay competitive.

Organizations across the globe are desperate to hire data scientists. A shortage of data scientists and business analysts means the employment outlook for professionals with the required knowledge and technical skills in these areas is extremely positive.

Harvard Business Review’s “Data Scientist: The Sexiest Job of the 21st Century,” notes that “the shortage of data scientists is becoming a serious constraint in some sectors.”

2. Rewarding Career !BBDS 13 Week course assumes that students know close to nothing about Data Science and Machine Learning. Its goal is to give you the concepts, intuitions you need to actually implement programs capable of learning from data.

We will cover a large number of techniques, from simplest and most commonly used (such as Linear Regression) to some Deep Learning techniques that regularly win competitions

Working Model showcase your skills

Data Science Methodology

Skills matrix for roles best fit

Extensive and Flexible Live Online Training

Instructor-Led Course

Training Video Recordings

Quality Training Materials

Two-Way Interactive Sessions

Graded Assignments & Professional Certificate

Mock Exam/Assessment

Career Path to learn, grow and prosper

Repeat anytime at no costs

Possible apprenticeship 2-3 projects

Post training support in career search

Resume & Interview Prep

Job Oriented Training

Job Placement and Placement Guidance

In non-digitally native organizations, these competencies will take time to evolve and mature. An organization should ÿrst commit to the importance of making analytics-driven decisions. This will require a clear technical roadmap and gradual alignment in culture and organizational structure.

With these in place, the organization should focus on the following four pillars to have a successful journey toward analytics maturity. These four pillars are:

Data scientists are like vehicle restoration engineers who investigate, plan and build a car!

To these Data Scientists, "Data" is analogous to scattered pieces of metal, some clean and some hidden in the dirt. If all the pieces are identiÿed, collected and assembled correctly, it will resemble a skeletal framework of the car the engineers have to construct/ restore. Some of these pieces would ÿt perfectly, but some may be broken in a way that they cannot be welded/ glued back together and some pieces may even be missing. However, if the engineers put these pieces together with care, due diligence and creativity, they’d restore the car, similar to its glory days.

In Data Science, a variety of measure-ments are used to reconstruct and represent the phenomenon of interest as accurately as possible so that we can control and improve it.The Data Infrastructure and Toolsare like the automotive parts (metal and plastic) that ÿt/ run throughout the skeletal framework once these pieces are collected, reÿned and arranged (though not perfectly at this point). These parts collectively provide agility and rigidity to the skeletal frame, enabling it to do interesting things. They can make the “boring data” come to life, and express itself in many interesting ways though not yet in a useful/ meaningful way.The Analytics Algorithms collectively mimics the function of an engine. A relatively small mass of the whole thing, it is however the entity through which all the moving parts (including the car itself ) are controlled to perform interesting maneuvers/ tasks, which may be unique to that particular car. In the end, what was initially a scattered pile of parts (metal and plastic), now move around in perfect harmony, doing interesting things.

That is what Data Scientists and Analytics Experts try to do…

Give life to the often “boring data”, similar to the engineers giving life to scattered car parts!

www.bigbang-datascience.com

Page 4: Data Science End to End€¦ · should focus on the following four pillars to have a successful journey toward analytics maturity. These four pillars are: Data scientists are like

By 2020, 60% of Information Delivered to Decision Makers Will Be Considered by Them Always Actionable, Doubling the Rate from the Current (2015) LevelSource: Cloudera

In today’s digital age, a successful organization has to achieve data and analytics competencies to remain competitive and relevant to the marketplace; i.e. it must be data and analytics driven. By leveraging detailed relevant data at its disposal and applying analytics over it, an organization can potentially capture new market share, better serve its existing customers, and/ or save millions of dollars annually.

This pyramid illustrates succession of steps to go from Raw Data toIntelligent Decisions and how to achieve such analytics maturity. Organizations can optimize their decisions basis speciÿc analysis performed on data and dependent on the speciÿc business use case.

The McKinsey Global Institute agrees. “Demand for deep analytical talent in the United States could be 50 to 60 percent greater than its projected supply by 2018.” The result may be a shortage of “140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts to analyze big data and make decisions based on their findings.” Data science careers are growing in virtually every sector: manufacturing, construction, transportation, warehousing, communication, science, health care, computer science, information technology, retail, sales, marketing, finance, insurance, education, government, security, law enforcement, and more.

The future is bright for the budding data scientist, but the path to arrival is not so clear. The advent of open data, open source computing, and inexpensive processing power is clearing the way for participation in the data revolution. But these elements need to be combined in an intelligent fashion. People from all walks can become data scientists; the requisites are hard work, talent, a desire to discover, and the right skills and tools

Big Bang Data Science Solutions now offers a complete intensive Data Science 13 Week Program that covers topics including Data Mining, Data Analytics, Applied Machine Learning, Predictive Analytics, Statistical Analysis, Regression Analysis, Classification Analysis, Clustering Analysis, Data Visualization.

Total Session Hours140+

Total Coding Hours80+

Effort Needed10-15 HRS/WEEK

Projects & Case Studies10+

Prospective CareersData Scientist – AI/ML Engineer

Course Duration12 Weeks + ( 1 Week for Resume & Interview Preparation)

Session Duration3 HRS (1 Hr 45 Mins – Concept Learning, 45 Mins – Concept Implementation)

Collaborate with mentors on coding assignments and projects in the last 45+ minutes of every session

That is what Data Scientists and Analytics Experts try to do…

Give life to the often “boring data”, similar to the engineers giving life to scattered car parts!

Fig. 1: From Raw Data To Intelligent Decisions

3. Course Overview

www.bigbang-datascience.com

Page 5: Data Science End to End€¦ · should focus on the following four pillars to have a successful journey toward analytics maturity. These four pillars are: Data scientists are like

Data scientists are like vehicle restoration engineers who investigate, plan and build a car!

To these Data Scientists, "Data" is analogous to scattered pieces of metal, some clean and some hidden in the dirt. If all the pieces are identiÿed, collected and assembled correctly, it will resemble a skeletal framework of the car the engineers have to construct/ restore. Some of these pieces would ÿt perfectly, but some may be broken in a way that they cannot be welded/ glued back together and some pieces may even be missing. However, if the engineers put these pieces together with care, due diligence and creativity, they’d restore the car, similar to its glory days.

In Data Science, a variety of measure-ments are used to reconstruct and represent the phenomenon of interest as accurately as possible so that we can control and improve it.The Data Infrastructure and Toolsare like the automotive parts (metal and plastic) that ÿt/ run throughout the skeletal framework once these pieces are collected, reÿned and arranged (though not perfectly at this point). These parts collectively provide agility and rigidity to the skeletal frame, enabling it to do interesting things. They can make the “boring data” come to life, and express itself in many interesting ways though not yet in a useful/ meaningful way.The Analytics Algorithms collectively mimics the function of an engine. A relatively small mass of the whole thing, it is however the entity through which all the moving parts (including the car itself ) are controlled to perform interesting maneuvers/ tasks, which may be unique to that particular car. In the end, what was initially a scattered pile of parts (metal and plastic), now move around in perfect harmony, doing interesting things.

This phase is concerned with the identiÿcation of all relevant data (ideally, raw, detailed data already validated) for the already formulated business problem, accessing it, assessing its quality, and transforming it into a usable form. Relevant data is often in detailed raw form with preliminary data quality checks already performed on it. However, in its current form, it is often not suitable for direct feeding into data mining algorithms. One often has to construct an Analytics Dataset (ADS) that includes many new features or variables˛ derived from the raw data. The ADS is usually kept in row-column format.

In classic enterprises, much of the relevant data is nowadays housed in an analytics data warehouse, a database, or a data lake. The relevant data is extracted, transformed, and loaded (ETLed) to the data warehouse from a variety of operational data sources, and its validity at this level is assured. This process has greatly reduced the data preparation burden for data scientists. In earlier days, data miners had to be concerned about every aspect of the data, from collection, integration, and cleansing from operational data sources, all the way through the generation and validation of ADS. Today, these tasks are routinely performed by IT data management/

-

4. Course InformationCourse Dates

April 21st, 2018- TO-July 14th, 2018

Meeting Times

Sunday 3:00 pm - 6:00 pm EST Mondays 8:00 pm - 11:00 pm EST Fridays 8:00 pm 11:00 pm EST

- OR -

Watch recorded sessions anytime

Location

ATC Innovation Center2972 Webb Bridge Road Alpharetta, GA - OR -Join remotely

di° erent enterprises. However, with the maturity of the practice, that has changed.

engineering teams at a department or enterprise level. This enables data scientists to start from an uniÿed, integrated, and trusted data source. Thus, they can concentrate their e° orts on the design and development of ADS, which is often the most interesting part of the data preparation and use.

Structured data in a data warehouse is kept in normalized form using an appropriate data schema. “Early binding” of this data makes things simpler for analysis, as long as all relevant data (sources, records, columns), in its entirety, has been placed there. If new business problems arise, data miners can rely on having all the relevant data to choose from. In reality, however, due to the cost of the data warehouse, not all relevant data (even structured) could be placed in the data warehouse. As a result, the predeÿned schema is only designed to support the current business problems and what is known today. This will not be desirable when a business problem arises that requires the use of some data elements not present in the data warehouse. However, this model works ÿne for standard use cases of data mining and “business as usual”scenarios.

The ADS typically is a de-normalized dataset—in row and columns—created based on the relevant data: one row per entity to model. Its records cover many examples, from ideally all possible situations and outcomes, for the business problem at hand. Hence “the more data, the better is the rule” — meaning that more rows (entities) and

more columns (attributes) are useful. ADS columns represent potential in˝uen tial variables that could have relevance for di° erent business outcome(s). Creation of ADS is a time-consuming process.

One of the ideals of data science has been the ability to automatically generate relevant features from raw data (transactional or non-transactional). However, this process is often manual when new problems are to be solved. A popular approach, called kitchen sink, is to automatically (or manually) generate a large number of candidate features and then let the learning algorithm select the most signiÿcant features using one or more variable selection techniques. However, there is ultimately no replacement for the creativity, expertise, and knowledge of the data miner at this step. Recently, deep nets are trained on huge amount of images with a common goal that the hidden layers progressively learn useful low-level and high-level features that could be applicable to a wide range of image recognition tasks.

ADS is the input to the advanced data mining algorithms, and its creation process is where most of the e° ort for a complex data science exercise goes, assuming all relevant raw data is already available and validated. ADS preparation could consume 60%–90% of all e° orts during a data science project, and

this is when the relevant data is already coming from a trusted source such as an EDW. This statistic often surprises people who are new to the ÿeld. Business domain knowledge, in addition to data and modeling expertise, are essential to the success of this phase. Technically, a good knowledge of data manipulation tools and techniques is essential in the creation of ADS. Examples of popular operations include sampling, balancing, aggregation, sorting, merging, appending, ÿltering, proÿling, selecting records or columns, and sub-setting.

Variable creation is the ÿnal step for creating an ADS. In addition to technical skills, the creation of ADS requires creativity, intuition, and experience; it is an art as well as a science. Poor ADS design is futile no matter how sophisticated the chosen data mining algorithms are or how relevant the raw data were. Often the technical failures in data mining can be directly traced back to poor ADS design and creation. The assumptions used for ADS construction, the computational processes to generate it, and the data validity checks to ascertain its health before modeling are all important contributors to the success or failure of the whole exercise.

The ÿnal variables (also called attributes or features) in ADS are typically inspired by the business domain knowledge. However, that is just the starting point, not the ÿnish line. For example, to detect fraud, ADS will be created ÿrst by deriving variables that are already known or expected to have associations and relationships with detecting fraud.

Although this typically comes with domain expertise, data miners always think out of the box and will derive/include other variables that have not been suggested or thought of by domain experts. They rely on powerful variable selection techniques for ÿnal consideration of useful variables in the model(s). Data preparation is both compute- and data-intensive.

High-performance computing platforms such as MPP databases, dedicated multiprocessor analytics servers, or computer clusters with fast shared storage have been critical to the robust and timely creation of ADS.

Available for extra time if

needed

5. Building a Stunning PortfolioBy the end of the program, students will be able to prove their skills with a public portfolio of Data Science, Machine Learning and Deep learning projects.

Many employers want to see your public GitHub page or Notebooks Azure.

All our projects follow standards guidelines and will be turned in through Notebooks Azure or public Git repositories with the intention of making it look good for future employers

5.1 Business Decision Framing (DecisionsFirst) Using DecisionsFirst for framing Analytics requirement We define the data that needs to be provided, we identify the analytic technology to be used and we define the work-flow for delivering this data to decision-makers.

A more effective way to define requirements for analytic projects, to frame those requirements accurately, is to model the decision-making to be improved. .

DecisionsFirst used to frame the decision for projects used in the program

www.bigbang-datascience.com

This Decisions First requirements approach provides critical information for successful analytic projects, complements work-flow requirements,

and correctly identifies the data that will be required for the effort.

Page 6: Data Science End to End€¦ · should focus on the following four pillars to have a successful journey toward analytics maturity. These four pillars are: Data scientists are like

Data scientists are like vehicle restoration engineers who investigate, plan and build a car!

To these Data Scientists, "Data" is analogous to scattered pieces of metal, some clean and some hidden in the dirt. If all the pieces are identiÿed, collected and assembled correctly, it will resemble a skeletal framework of the car the engineers have to construct/ restore. Some of these pieces would ÿt perfectly, but some may be broken in a way that they cannot be welded/ glued back together and some pieces may even be missing. However, if the engineers put these pieces together with care, due diligence and creativity, they’d restore the car, similar to its glory days.

In Data Science, a variety of measure-ments are used to reconstruct and represent the phenomenon of interest as accurately as possible so that we can control and improve it.The Data Infrastructure and Toolsare like the automotive parts (metal and plastic) that ÿt/ run throughout the skeletal framework once these pieces are collected, reÿned and arranged (though not perfectly at this point). These parts collectively provide agility and rigidity to the skeletal frame, enabling it to do interesting things. They can make the “boring data” come to life, and express itself in many interesting ways though not yet in a useful/ meaningful way.The Analytics Algorithms collectively mimics the function of an engine. A relatively small mass of the whole thing, it is however the entity through which all the moving parts (including the car itself ) are controlled to perform interesting maneuvers/ tasks, which may be unique to that particular car. In the end, what was initially a scattered pile of parts (metal and plastic), now move around in perfect harmony, doing interesting things.

This phase is concerned with the identiÿcation of all relevant data (ideally, raw, detailed data already validated) for the already formulated business problem, accessing it, assessing its quality, and transforming it into a usable form. Relevant data is often in detailed raw form with preliminary data quality checks already performed on it. However, in its current form, it is often not suitable for direct feeding into data mining algorithms. One often has to construct an Analytics Dataset (ADS) that includes many new features or variables˛ derived from the raw data. The ADS is usually kept in row-column format.

In classic enterprises, much of the relevant data is nowadays housed in an analytics data warehouse, a database, or a data lake. The relevant data is extracted, transformed, and loaded (ETLed) to the data warehouse from a variety of operational data sources, and its validity at this level is assured. This process has greatly reduced the data preparation burden for data scientists. In earlier days, data miners had to be concerned about every aspect of the data, from collection, integration, and cleansing from operational data sources, all the way through the generation and validation of ADS. Today, these tasks are routinely performed by IT data management/

That is what Data Scientists and Analytics Experts try to do…

Give life to the often “boring data”, similar to the engineers giving life to scattered car parts!

Data science (covering predictive analytics, data mining, machine learning and related practices) is a multidisciplinary ÿeld that requires knowledge of a number of di° erent skills, practices and technologies, including (but not limited to) machine learning, pattern recognition, applied mathematics, programming, algorithms, statistics and databases. In the context of big data, more skills and knowledge is required, such as knowledge of distributed computing techniques/ algorithms and architectures. By nature, data science is a creative process that is a combination of both science, engineering, and art. Hence, its success has been more dependent on the quality and the experience of the team that has been carrying it out. Thus in the past, for some time, data mining projects were not repeatable with the same level of success across

6. Course CurriculumWeek 1: Data Science FundamentalsLearn Data Science basics, Good understanding of Data Science and Data Analytics, Overview of EMC and CAP certificates, Introduction to concepts, methodologies, and best practices..

Session 1- Introduction to Data Science

Session 2- Learning Path, CAP Certificate- Crash Course in R

Session 3- Tableau: Introduction & Basics: Bar chart- RapidMiner: Introduction to RapidMiner- Capstone Project – Project Selection (Opendatasets & problems)

Business Understanding

& Use

Analysis & Assessment

ImplementationData

Understanding& Use

Business Focused

Data FocusedOperations Focused

Analytics Focused

PHASES

Data science phases at a high level.

engineering teams at a department or enterprise level. This enables data scientists to start from an uniÿed, integrated, and trusted data source. Thus, they can concentrate their e° orts on the design and development of ADS, which is often the most interesting part of the data preparation and use.

Structured data in a data warehouse is kept in normalized form using an appropriate data schema. “Early binding” of this data makes things simpler for analysis, as long as all relevant data (sources, records, columns), in its entirety, has been placed there. If new business problems arise, data miners can rely on having all the relevant data to choose from. In reality, however, due to the cost of the data warehouse, not all relevant data (even structured) could be placed in the data warehouse. As a result, the predeÿned schema is only designed to support the current business problems and what is known today. This will not be desirable when a business problem arises that requires the use of some data elements not present in the data warehouse. However, this model works ÿne for standard use cases of data mining and “business as usual”scenarios.

The ADS typically is a de-normalized dataset—in row and columns—created based on the relevant data: one row per entity to model. Its records cover many examples, from ideally all possible situations and outcomes, for the business problem at hand. Hence “the more data, the better is the rule” — meaning that more rows (entities) and

more columns (attributes) are useful. ADS columns represent potential in˝uen tial variables that could have relevance for di° erent business outcome(s). Creation of ADS is a time-consuming process.

One of the ideals of data science has been the ability to automatically generate relevant features from raw data (transactional or non-transactional). However, this process is often manual when new problems are to be solved. A popular approach, called kitchen sink, is to automatically (or manually) generate a large number of candidate features and then let the learning algorithm select the most signiÿcant features using one or more variable selection techniques. However, there is ultimately no replacement for the creativity, expertise, and knowledge of the data miner at this step. Recently, deep nets are trained on huge amount of images with a common goal that the hidden layers progressively learn useful low-level and high-level features that could be applicable to a wide range of image recognition tasks.

ADS is the input to the advanced data mining algorithms, and its creation process is where most of the e° ort for a complex data science exercise goes, assuming all relevant raw data is already available and validated. ADS preparation could consume 60%–90% of all e° orts during a data science project, and

this is when the relevant data is already coming from a trusted source such as an EDW. This statistic often surprises people who are new to the ÿeld. Business domain knowledge, in addition to data and modeling expertise, are essential to the success of this phase. Technically, a good knowledge of data manipulation tools and techniques is essential in the creation of ADS. Examples of popular operations include sampling, balancing, aggregation, sorting, merging, appending, ÿltering, proÿling, selecting records or columns, and sub-setting.

Variable creation is the ÿnal step for creating an ADS. In addition to technical skills, the creation of ADS requires creativity, intuition, and experience; it is an art as well as a science. Poor ADS design is futile no matter how sophisticated the chosen data mining algorithms are or how relevant the raw data were. Often the technical failures in data mining can be directly traced back to poor ADS design and creation. The assumptions used for ADS construction, the computational processes to generate it, and the data validity checks to ascertain its health before modeling are all important contributors to the success or failure of the whole exercise.

The ÿnal variables (also called attributes or features) in ADS are typically inspired by the business domain knowledge. However, that is just the starting point, not the ÿnish line. For example, to detect fraud, ADS will be created ÿrst by deriving variables that are already known or expected to have associations and relationships with detecting fraud.

Although this typically comes with domain expertise, data miners always think out of the box and will derive/include other variables that have not been suggested or thought of by domain experts. They rely on powerful variable selection techniques for ÿnal consideration of useful variables in the model(s). Data preparation is both compute- and data-intensive.

High-performance computing platforms such as MPP databases, dedicated multiprocessor analytics servers, or computer clusters with fast shared storage have been critical to the robust and timely creation of ADS.

www.bigbang-datascience.com

Week 2: Business Understanding & Problem Framing

Determine Business Objectives, Assess Situation, and Determine Data Mining Goals, Produce Project Plan. Good understanding of problem framing, Decision Framing, Decision Analysis and Decision implementation using DecisionsFirst Modeler.

Session 1- Data Understanding & Data Preparation

Session 2- Exploratory Data Analysis in R- Exploratory Data Analysis in Python

Session 3- Tableau: Time series, Aggregation, and Filters- RapidMiner: Data Preparation & CorrelationAnalysis- Capstone Project – Analytics Approach 1 (Datapreparation & classification )

5.2 Data Science Process (CRISP-DM)

The program follows the Cross-Industry Standard For Data Mining (CRISP-DM). CRISP-DM puts business understanding front and center at the beginning of the project.

CRISP-DM remains the top methodology for data mining projects, with essentially the same percentage as in 2007 (43% vs 42%)

5.3 Fundamentals of Python and R

Data scientists must know how to code – start by learning the fundamentals of two popular programming languages Python and R.

Basics of Python and R Conditional and loops String and list objects Functions & OOPs concepts Exception handling

Page 7: Data Science End to End€¦ · should focus on the following four pillars to have a successful journey toward analytics maturity. These four pillars are: Data scientists are like

Data scientists are like vehicle restoration engineers who investigate, plan and build a car!

To these Data Scientists, "Data" is analogous to scattered pieces of metal, some clean and some hidden in the dirt. If all the pieces are identiÿed, collected and assembled correctly, it will resemble a skeletal framework of the car the engineers have to construct/ restore. Some of these pieces would ÿt perfectly, but some may be broken in a way that they cannot be welded/ glued back together and some pieces may even be missing. However, if the engineers put these pieces together with care, due diligence and creativity, they’d restore the car, similar to its glory days.

In Data Science, a variety of measure-ments are used to reconstruct and represent the phenomenon of interest as accurately as possible so that we can control and improve it.The Data Infrastructure and Toolsare like the automotive parts (metal and plastic) that ÿt/ run throughout the skeletal framework once these pieces are collected, reÿned and arranged (though not perfectly at this point). These parts collectively provide agility and rigidity to the skeletal frame, enabling it to do interesting things. They can make the “boring data” come to life, and express itself in many interesting ways though not yet in a useful/ meaningful way.The Analytics Algorithms collectively mimics the function of an engine. A relatively small mass of the whole thing, it is however the entity through which all the moving parts (including the car itself ) are controlled to perform interesting maneuvers/ tasks, which may be unique to that particular car. In the end, what was initially a scattered pile of parts (metal and plastic), now move around in perfect harmony, doing interesting things.

This phase is concerned with the identiÿcation of all relevant data (ideally, raw, detailed data already validated) for the already formulated business problem, accessing it, assessing its quality, and transforming it into a usable form. Relevant data is often in detailed raw form with preliminary data quality checks already performed on it. However, in its current form, it is often not suitable for direct feeding into data mining algorithms. One often has to construct an Analytics Dataset (ADS) that includes many new features or variables˛ derived from the raw data. The ADS is usually kept in row-column format.

In classic enterprises, much of the relevant data is nowadays housed in an analytics data warehouse, a database, or a data lake. The relevant data is extracted, transformed, and loaded (ETLed) to the data warehouse from a variety of operational data sources, and its validity at this level is assured. This process has greatly reduced the data preparation burden for data scientists. In earlier days, data miners had to be concerned about every aspect of the data, from collection, integration, and cleansing from operational data sources, all the way through the generation and validation of ADS. Today, these tasks are routinely performed by IT data management/

That is what Data Scientists and Analytics Experts try to do…

Give life to the often “boring data”, similar to the engineers giving life to scattered car parts!

Week 3: Data Understanding & Data Preparation

Data Understanding & Data Preparation Exploratory Data Analysis using R & Python, Descriptive statistics, hypothesis testing, data preprocessing, missing values imputation, data transformation, Dive deep into R programming language from basic syntax to advanced packages and data visualization (e.g. reshape2, dplyr, string manipulation, ggplot2, R Shiny).

Session 1- Data Understanding & Data Preparation

Session 2- Exploratory Data Analysis in R- Exploratory Data Analysis in Python

Session 3- Tableau: Time series, Aggregation, andFilters- RapidMiner: Data Preparation & CorrelationAnalysis- Capstone Project – Analytics Approach 1(Data preparation & classification )

engineering teams at a department or enterprise level. This enables data scientists to start from an uniÿed, integrated, and trusted data source. Thus, they can concentrate their e° orts on the design and development of ADS, which is often the most interesting part of the data preparation and use.

Structured data in a data warehouse is kept in normalized form using an appropriate data schema. “Early binding” of this data makes things simpler for analysis, as long as all relevant data (sources, records, columns), in its entirety, has been placed there. If new business problems arise, data miners can rely on having all the relevant data to choose from. In reality, however, due to the cost of the data warehouse, not all relevant data (even structured) could be placed in the data warehouse. As a result, the predeÿned schema is only designed to support the current business problems and what is known today. This will not be desirable when a business problem arises that requires the use of some data elements not present in the data warehouse. However, this model works ÿne for standard use cases of data mining and “business as usual”scenarios.

The ADS typically is a de-normalized dataset—in row and columns—created based on the relevant data: one row per entity to model. Its records cover many examples, from ideally all possible situations and outcomes, for the business problem at hand. Hence “the more data, the better is the rule” — meaning that more rows (entities) and

more columns (attributes) are useful. ADS columns represent potential in˝uen tial variables that could have relevance for di° erent business outcome(s). Creation of ADS is a time-consuming process.

One of the ideals of data science has been the ability to automatically generate relevant features from raw data (transactional or non-transactional). However, this process is often manual when new problems are to be solved. A popular approach, called kitchen sink, is to automatically (or manually) generate a large number of candidate features and then let the learning algorithm select the most signiÿcant features using one or more variable selection techniques. However, there is ultimately no replacement for the creativity, expertise, and knowledge of the data miner at this step. Recently, deep nets are trained on huge amount of images with a common goal that the hidden layers progressively learn useful low-level and high-level features that could be applicable to a wide range of image recognition tasks.

ADS is the input to the advanced data mining algorithms, and its creation process is where most of the e° ort for a complex data science exercise goes, assuming all relevant raw data is already available and validated. ADS preparation could consume 60%–90% of all e° orts during a data science project, and

this is when the relevant data is already coming from a trusted source such as an EDW. This statistic often surprises people who are new to the ÿeld. Business domain knowledge, in addition to data and modeling expertise, are essential to the success of this phase. Technically, a good knowledge of data manipulation tools and techniques is essential in the creation of ADS. Examples of popular operations include sampling, balancing, aggregation, sorting, merging, appending, ÿltering, proÿling, selecting records or columns, and sub-setting.

Variable creation is the ÿnal step for creating an ADS. In addition to technical skills, the creation of ADS requires creativity, intuition, and experience; it is an art as well as a science. Poor ADS design is futile no matter how sophisticated the chosen data mining algorithms are or how relevant the raw data were. Often the technical failures in data mining can be directly traced back to poor ADS design and creation. The assumptions used for ADS construction, the computational processes to generate it, and the data validity checks to ascertain its health before modeling are all important contributors to the success or failure of the whole exercise.

The ÿnal variables (also called attributes or features) in ADS are typically inspired by the business domain knowledge. However, that is just the starting point, not the ÿnish line. For example, to detect fraud, ADS will be created ÿrst by deriving variables that are already known or expected to have associations and relationships with detecting fraud.

Although this typically comes with domain expertise, data miners always think out of the box and will derive/include other variables that have not been suggested or thought of by domain experts. They rely on powerful variable selection techniques for ÿnal consideration of useful variables in the model(s). Data preparation is both compute- and data-intensive.

High-performance computing platforms such as MPP databases, dedicated multiprocessor analytics servers, or computer clusters with fast shared storage have been critical to the robust and timely creation of ADS.

5.4 Machine Learning in R/Python and RapidMiner

Machines have increased the ability to interpret large volumes of complex data. Combine aspects of computer science with statistics to formulate algorithms that help machines draw insights from structured and unstructured data.

Building models using below algorithms

Linear and logistics regression Decision trees Support vector machines (SVMs) and

Kernal SVM Random forests XGBoost K nearest neighbor & hierarchical

clustering Principal component analysis Text analytics and time series

forecasting

Decisions

Advanced Analytics

Business Intelligence

Data Types (Structured, Semi-structured, Unstructured)

Data Infrastructure

From Raw Data To Intelligent Decisions

www.bigbang-datascience.com

Week 4 : Classification Analysis - Part 1

Time to build your first models. You'll start in machine learning with supervised models, covering everything from the classics to cutting-edge techniques

Get to know R Packages and sci-kit learn, the most widely used modeling and machine learning package in Python.

Page 8: Data Science End to End€¦ · should focus on the following four pillars to have a successful journey toward analytics maturity. These four pillars are: Data scientists are like

Data scientists are like vehicle restoration engineers who investigate, plan and build a car!

To these Data Scientists, "Data" is analogous to scattered pieces of metal, some clean and some hidden in the dirt. If all the pieces are identiÿed, collected and assembled correctly, it will resemble a skeletal framework of the car the engineers have to construct/ restore. Some of these pieces would ÿt perfectly, but some may be broken in a way that they cannot be welded/ glued back together and some pieces may even be missing. However, if the engineers put these pieces together with care, due diligence and creativity, they’d restore the car, similar to its glory days.

In Data Science, a variety of measure-ments are used to reconstruct and represent the phenomenon of interest as accurately as possible so that we can control and improve it.The Data Infrastructure and Toolsare like the automotive parts (metal and plastic) that ÿt/ run throughout the skeletal framework once these pieces are collected, reÿned and arranged (though not perfectly at this point). These parts collectively provide agility and rigidity to the skeletal frame, enabling it to do interesting things. They can make the “boring data” come to life, and express itself in many interesting ways though not yet in a useful/ meaningful way.The Analytics Algorithms collectively mimics the function of an engine. A relatively small mass of the whole thing, it is however the entity through which all the moving parts (including the car itself ) are controlled to perform interesting maneuvers/ tasks, which may be unique to that particular car. In the end, what was initially a scattered pile of parts (metal and plastic), now move around in perfect harmony, doing interesting things.

That is what Data Scientists and Analytics Experts try to do…

Give life to the often “boring data”, similar to the engineers giving life to scattered car parts!

engineering teams at a department or enterprise level. This enables data scientists to start from an uniÿed, integrated, and trusted data source. Thus, they can concentrate their e° orts on the design and development of ADS, which is often the most interesting part of the data preparation and use.

Structured data in a data warehouse is kept in normalized form using an appropriate data schema. “Early binding” of this data makes things simpler for analysis, as long as all relevant data (sources, records, columns), in its entirety, has been placed there. If new business problems arise, data miners can rely on having all the relevant data to choose from. In reality, however, due to the cost of the data warehouse, not all relevant data (even structured) could be placed in the data warehouse. As a result, the predeÿned schema is only designed to support the current business problems and what is known today. This will not be desirable when a business problem arises that requires the use of some data elements not present in the data warehouse. However, this model works ÿne for standard use cases of data mining and “business as usual”scenarios.

The ADS typically is a de-normalized dataset—in row and columns—created based on the relevant data: one row per entity to model. Its records cover many examples, from ideally all possible situations and outcomes, for the business problem at hand. Hence “the more data, the better is the rule” — meaning that more rows (entities) and

more columns (attributes) are useful. ADS columns represent potential in˝uen tial variables that could have relevance for di° erent business outcome(s). Creation of ADS is a time-consuming process.

One of the ideals of data science has been the ability to automatically generate relevant features from raw data (transactional or non-transactional). However, this process is often manual when new problems are to be solved. A popular approach, called kitchen sink, is to automatically (or manually) generate a large number of candidate features and then let the learning algorithm select the most signiÿcant features using one or more variable selection techniques. However, there is ultimately no replacement for the creativity, expertise, and knowledge of the data miner at this step. Recently, deep nets are trained on huge amount of images with a common goal that the hidden layers progressively learn useful low-level and high-level features that could be applicable to a wide range of image recognition tasks.

ADS is the input to the advanced data mining algorithms, and its creation process is where most of the e° ort for a complex data science exercise goes, assuming all relevant raw data is already available and validated. ADS preparation could consume 60%–90% of all e° orts during a data science project, and

this is when the relevant data is already coming from a trusted source such as an EDW. This statistic often surprises people who are new to the ÿeld. Business domain knowledge, in addition to data and modeling expertise, are essential to the success of this phase. Technically, a good knowledge of data manipulation tools and techniques is essential in the creation of ADS. Examples of popular operations include sampling, balancing, aggregation, sorting, merging, appending, ÿltering, proÿling, selecting records or columns, and sub-setting.

Variable creation is the ÿnal step for creating an ADS. In addition to technical skills, the creation of ADS requires creativity, intuition, and experience; it is an art as well as a science. Poor ADS design is futile no matter how sophisticated the chosen data mining algorithms are or how relevant the raw data were. Often the technical failures in data mining can be directly traced back to poor ADS design and creation. The assumptions used for ADS construction, the computational processes to generate it, and the data validity checks to ascertain its health before modeling are all important contributors to the success or failure of the whole exercise.

The ÿnal variables (also called attributes or features) in ADS are typically inspired by the business domain knowledge. However, that is just the starting point, not the ÿnish line. For example, to detect fraud, ADS will be created ÿrst by deriving variables that are already known or expected to have associations and relationships with detecting fraud.

Although this typically comes with domain expertise, data miners always think out of the box and will derive/include other variables that have not been suggested or thought of by domain experts. They rely on powerful variable selection techniques for ÿnal consideration of useful variables in the model(s). Data preparation is both compute- and data-intensive.

High-performance computing platforms such as MPP databases, dedicated multiprocessor analytics servers, or computer clusters with fast shared storage have been critical to the robust and timely creation of ADS.

5.5 Data Visualization: Matplotlib, Seaborn, ggplot & Tableau

Complex data sets call for simple representations that are easy to follow. Visualize and communicate key insights derived from data effectively by using tools like Matplotlib and Tableau

Interactive visualizations with Matplotlib Data visualizations using Tableau Tableau dashboard and storyboard Tableau and R integration

5.7 Capstone Projects

The course is project based. There are 9 projects in total and students will be working on ONE team project starting from day 1. By the end, you'll present a capstone after going off to gather real data and build your own models using machine learning techniques.

Customer Churn Project Classification Problem Projects Regression Problem Projects Clustering Problem Projects Regression Problem Projects Time Series Analysis Projects Fraud & Anomalies detection Projects

Here we'll focus on situations where we have a knowable and observable outcome. You'll start with some of the classical models of machine learning like decision trees and OLS. Learn techniques for both regression and classification. You'll then move to more and more sophisticated models, with random forest and boosting.

Session 1- Classification Analysis

Session 2- Decision Tree & Random Forest in R

- Decision Tree & Random Forest in Python

Session 3- Tableau Basics: Maps, Scatter-plots,

- Rapid Miner: Decision Tree

- Capstone Project – Analytic Approach II(Machine learning techniques)Week 5 : Classification Analysis - Part 2

Session 1

- Logistic Regression, KNN, Naïve Bayes in R

- Logistic Regression, KNN, Naïve Bayes inPython

Session 2- SVM, Kernel SVM in R & PythonSession 3- Tableau: Joining and Blending Data, PLUS:Dual Axis Charts- RapidMiner: Logistic Regression, KNN, NaïveBayes- Capstone Project – Project AnalysisTechniques (Presentation techniques)

www.bigbang-datascience.com

5.6 Data Visualization: Matplotlib, Seaborn, ggplot & Tableau

Lastly, manage your infrastructure with a data engineering platform like Spark so that your efforts can be focused on solving data problems rather than problems of machines.

Introduction to Big Data & Spark RDD’s in Spark, data frames & Spark SQL Spark streaming, MLib & GraphX Linear algebra

Page 9: Data Science End to End€¦ · should focus on the following four pillars to have a successful journey toward analytics maturity. These four pillars are: Data scientists are like

This phase is concerned with the identiÿcation of all relevant data (ideally, raw, detailed data already validated) for the already formulated business problem, accessing it, assessing its quality, and transforming it into a usable form. Relevant data is often in detailed raw form with preliminary data quality checks already performed on it. However, in its current form, it is often not suitable for direct feeding into data mining algorithms. One often has to construct an Analytics Dataset (ADS) that includes many new features or variables˛ derived from the raw data. The ADS is usually kept in row-column format.

In classic enterprises, much of the relevant data is nowadays housed in an analytics data warehouse, a database, or a data lake. The relevant data is extracted, transformed, and loaded (ETLed) to the data warehouse from a variety of operational data sources, and its validity at this level is assured. This process has greatly reduced the data preparation burden for data scientists. In earlier days, data miners had to be concerned about every aspect of the data, from collection, integration, and cleansing from operational data sources, all the way through the generation and validation of ADS. Today, these tasks are routinely performed by IT data management/

this is when the relevant data is already coming from a trusted source such as an EDW. This statistic often surprises people who are new to the ÿeld. Business domain knowledge, in addition to data and modeling expertise, are essential to the success of this phase. Technically, a good knowledge of data manipulation tools and techniques is essential in the creation of ADS. Examples of popular operations include sampling, balancing, aggregation, sorting, merging, appending, ÿltering, proÿling, selecting records or columns, and sub-setting.

Variable creation is the ÿnal step for creating an ADS. In addition to technical skills, the creation of ADS requires creativity, intuition, and experience; it is an art as well as a science. Poor ADS design is futile no matter how sophisticated the chosen data mining algorithms are or how relevant the raw data were. Often the technical failures in data mining can be directly traced back to poor ADS design and creation. The assumptions used for ADS construction, the computational processes to generate it, and the data validity checks to ascertain its health before modeling are all important contributors to the success or failure of the whole exercise.

The ÿnal variables (also called attributes or features) in ADS are typically inspired by the business domain knowledge. However, that is just the starting point, not the ÿnish line. For example, to detect fraud, ADS will be created ÿrst by deriving variables that are already known or expected to have associations and relationships with detecting fraud.

Although this typically comes with domain expertise, data miners always think out of the box and will derive/include other variables that have not been suggested or thought of by domain experts. They rely on powerful variable selection techniques for ÿnal consideration of useful variables in the model(s). Data preparation is both compute- and data-intensive.

High-performance computing platforms such as MPP databases, dedicated multiprocessor analytics servers, or computer clusters with fast shared storage have been critical to the robust and timely creation of ADS.

Machine learning is used for both structured and unstructured data such as text, speech, audio, and images. Aside from learning, there are other types of algorithms used in data mining. For e.g., iterative graph algorithms are used for link

analysis, community detection, and PageRank. A variety of feature extraction algorithms are used for creating new variables (attributes) as a combination of the original variables and/or records (transactions) across time. A data mining model (or learner) is an e˙cien t and compact representation of a very large data collection with speciÿc goals (represented by ADS) using mathematical concepts and language. A good data mining model is one that learns e° ectively from the data, using a chosen learning algorithm, and then generalizes what it has learned to other di° erent but similar observations.

In the A-phase, many learners are built and compared with each other based on a set of selected business metrics. A champion model (which could be a set of models) that provides the highest business value will be selected for ÿnal deployment. The champion model provides the best business value based on the desired business metrics. The model development and selection process is iterative; this can e° ectively be automated as long as ADS is ÿxed. Today, due to signiÿcant advances in data science tools and expertise, modeling or learning is the most straightforward part of all data sciences phases.

Learners can be categorized into two major types: ÿxed and ˝e xible (Non-parametricˇ models are included in this.) Fixed models refer to those whose mathematical representation has a ÿxed structure and size, given a ÿxed number of inputs. Flexible models are those whose internal mathematical structure

can vary with the data, while the number of inputs stays the same. Variable presentation is essential for algorithms such as linear regression family, neural networks, and many others that ingest only numeric variables as inputs. They are also of great importance for linear algorithms (such as linear and logistic regression) to capture nonlinearities inherent in the data. Some nonparametric ˝e xible learners such as decision trees and random forests can directly ingest nominal variables, and inherently are less sensitive to such transformations. In almost all cases, such transformations help in understanding the behavior of the model and the interpretation of its results. Decision trees of singlevariables are sometimes used for the grouping or binning of variables.

In very special situations, one may develop an ensemble model—a group of di° erent models that all participate in making a decision. Bagging, boosting, and fusion (stacking) are some of the techniques used to develop such models. In bagging, one generates random variations of the training data by resampling, develops a learner for each sample, and then combines the results by voting or averaging. Empirical results have shown that this process works well, especially in the case of weak learners such as

decision trees. In boosting, one assigns a variable weight to each training example such that each new learner focuses more on the examples the other learners have failed to accurately predict. In fusion or stacking, the predictions (outputs) of di° erent learners are input to a “super learner” to combine them optimally.

Net˝ix pr ize competition was an extreme and unrealistic example of an ensemble approach in which the participating teams discovered that they could improve their prediction results by combining their models with other participating teams. The top two winners had built ensembles of over 100 learners, where even fusing the top two winning ensembles could further improve the prediction performance. However, most of the teams were academically oriented and had an inÿnite amount of time (three years) and resources at their disposal to do this. This is a luxury that does not exist in real-world business scenarios, where decisions have to be made much faster or opportunities will be lost. A popular and practical ensemble model is random forest, which uses a combina-tion of bagging and random selection of features to combine hundreds and thousands of smaller weak learners (decision trees).

Week 6 : Regression Analysis - Part 1

Session 1- Data Mining & Machine Learning(Regression Analysis)

Session 2- Decision Tree & Random Forest in R

- Decision Tree & Random Forest in Python

Session 3- Tableau: Table Calculations, AdvancedDashboards, Storytelling- Capstone Project – Project AnalysisTechniques (Data visualization techniques)

Week 8: Clustering, Association Rules, Dimensionality Reduction

The second largest branch of machine learning, here we'll cover various forms of unsupervised models, useful for things like feature reduction and clustering.

Week 7 : Regression Analysis - Part 2

Session 1

- Simple Linear, Multiple Linear, PolynomialLinear in R

- Simple Linear, Multiple Linear, PolynomialLinear in Python

Session 2- Support Vector Machine in R & Python

Session 3- Tableau: Table Calculations, AdvancedDashboards, Storytelling- Rapid Miner: Linear Regression- Capstone Project – Data Analysis ExecutionPlan (Data visualization tools)

www.bigbang-datascience.com

5.8 Analytical Tools used

9 Analytics tools are being used throughout the program. Students will learn how to use them all to transform data into product and information

R & Python for Data Science Tableau BigML RapidMiner SAS for Enterprise Miner IBM Watson Analytics IBM BleuMix

Page 10: Data Science End to End€¦ · should focus on the following four pillars to have a successful journey toward analytics maturity. These four pillars are: Data scientists are like

This phase is concerned with the identiÿcation of all relevant data (ideally, raw, detailed data already validated) for the already formulated business problem, accessing it, assessing its quality, and transforming it into a usable form. Relevant data is often in detailed raw form with preliminary data quality checks already performed on it. However, in its current form, it is often not suitable for direct feeding into data mining algorithms. One often has to construct an Analytics Dataset (ADS) that includes many new features or variables˛ derived from the raw data. The ADS is usually kept in row-column format.

In classic enterprises, much of the relevant data is nowadays housed in an analytics data warehouse, a database, or a data lake. The relevant data is extracted, transformed, and loaded (ETLed) to the data warehouse from a variety of operational data sources, and its validity at this level is assured. This process has greatly reduced the data preparation burden for data scientists. In earlier days, data miners had to be concerned about every aspect of the data, from collection, integration, and cleansing from operational data sources, all the way through the generation and validation of ADS. Today, these tasks are routinely performed by IT data management/

engineering teams at a department or enterprise level. This enables data scientists to start from an uniÿed, integrated, and trusted data source. Thus, they can concentrate their e° orts on the design and development of ADS, which is often the most interesting part of the data preparation and use.

Structured data in a data warehouse is kept in normalized form using an appropriate data schema. “Early binding” of this data makes things simpler for analysis, as long as all relevant data (sources, records, columns), in its entirety, has been placed there. If new business problems arise, data miners can rely on having all the relevant data to choose from. In reality, however, due to the cost of the data warehouse, not all relevant data (even structured) could be placed in the data warehouse. As a result, the predeÿned schema is only designed to support the current business problems and what is known today. This will not be desirable when a business problem arises that requires the use of some data elements not present in the data warehouse. However, this model works ÿne for standard use cases of data mining and “business as usual”scenarios.

The ADS typically is a de-normalized dataset—in row and columns—created based on the relevant data: one row per entity to model. Its records cover many examples, from ideally all possible situations and outcomes, for the business problem at hand. Hence “the more data, the better is the rule” — meaning that more rows (entities) and

more columns (attributes) are useful. ADS columns represent potential in˝uen tial variables that could have relevance for di° erent business outcome(s). Creation of ADS is a time-consuming process.

One of the ideals of data science has been the ability to automatically generate relevant features from raw data (transactional or non-transactional). However, this process is often manual when new problems are to be solved. A popular approach, called kitchen sink, is to automatically (or manually) generate a large number of candidate features and then let the learning algorithm select the most signiÿcant features using one or more variable selection techniques. However, there is ultimately no replacement for the creativity, expertise, and knowledge of the data miner at this step. Recently, deep nets are trained on huge amount of images with a common goal that the hidden layers progressively learn useful low-level and high-level features that could be applicable to a wide range of image recognition tasks.

ADS is the input to the advanced data mining algorithms, and its creation process is where most of the e° ort for a complex data science exercise goes, assuming all relevant raw data is already available and validated. ADS preparation could consume 60%–90% of all e° orts during a data science project, and

Machine learning is used for both structured and unstructured data such as text, speech, audio, and images. Aside from learning, there are other types of algorithms used in data mining. For e.g., iterative graph algorithms are used for link

analysis, community detection, and PageRank. A variety of feature extraction algorithms are used for creating new variables (attributes) as a combination of the original variables and/or records (transactions) across time. A data mining model (or learner) is an e˙cien t and compact representation of a very large data collection with speciÿc goals (represented by ADS) using mathematical concepts and language. A good data mining model is one that learns e° ectively from the data, using a chosen learning algorithm, and then generalizes what it has learned to other di° erent but similar observations.

In the A-phase, many learners are built and compared with each other based on a set of selected business metrics. A champion model (which could be a set of models) that provides the highest business value will be selected for ÿnal deployment. The champion model provides the best business value based on the desired business metrics. The model development and selection process is iterative; this can e° ectively be automated as long as ADS is ÿxed. Today, due to signiÿcant advances in data science tools and expertise, modeling or learning is the most straightforward part of all data sciences phases.

Learners can be categorized into two major types: ÿxed and ˝e xible (Non-parametricˇ models are included in this.) Fixed models refer to those whose mathematical representation has a ÿxed structure and size, given a ÿxed number of inputs. Flexible models are those whose internal mathematical structure

can vary with the data, while the number of inputs stays the same. Variable presentation is essential for algorithms such as linear regression family, neural networks, and many others that ingest only numeric variables as inputs. They are also of great importance for linear algorithms (such as linear and logistic regression) to capture nonlinearities inherent in the data. Some nonparametric ˝e xible learners such as decision trees and random forests can directly ingest nominal variables, and inherently are less sensitive to such transformations. In almost all cases, such transformations help in understanding the behavior of the model and the interpretation of its results. Decision trees of singlevariables are sometimes used for the grouping or binning of variables.

In very special situations, one may develop an ensemble model—a group of di° erent models that all participate in making a decision. Bagging, boosting, and fusion (stacking) are some of the techniques used to develop such models. In bagging, one generates random variations of the training data by resampling, develops a learner for each sample, and then combines the results by voting or averaging. Empirical results have shown that this process works well, especially in the case of weak learners such as

decision trees. In boosting, one assigns a variable weight to each training example such that each new learner focuses more on the examples the other learners have failed to accurately predict. In fusion or stacking, the predictions (outputs) of di° erent learners are input to a “super learner” to combine them optimally.

Net˝ix pr ize competition was an extreme and unrealistic example of an ensemble approach in which the participating teams discovered that they could improve their prediction results by combining their models with other participating teams. The top two winners had built ensembles of over 100 learners, where even fusing the top two winning ensembles could further improve the prediction performance. However, most of the teams were academically oriented and had an inÿnite amount of time (three years) and resources at their disposal to do this. This is a luxury that does not exist in real-world business scenarios, where decisions have to be made much faster or opportunities will be lost. A popular and practical ensemble model is random forest, which uses a combina-tion of bagging and random selection of features to combine hundreds and thousands of smaller weak learners (decision trees).

Unsupervised learning doesn't focus on an outcome, but rather finds associations and relationships present in data. You'll start with some classic clustering algorithms like k-means to break almost any data set into groups. You'll then get into cutting-edge techniques like Anomalies Detections and Time Series Analysis

Session 1- Clustering Analysis

- Kmeans, Hierarchical Clustering in R

- Kmeans, Hierarchical Clustering in Python

Session 2- Association Rules

- Apriori, Eclat in R

- Apriori, Eclat in Python

Session 3- Dimensionality Reduction: PCA, LDA &Kernel PCA in R and Python

- RapidMiner: Data Preparation & CorrelationAnalysis

- Capstone Project – Data Analysis Review(Interpretation)

www.bigbang-datascience.com

Organizations That Analyze All Relevant Data and Deliver Actionable Information Will Achieve Extra $430 Billion in Productivity Gains Over Their Less Analytically Oriented Peers by 2020Source: Cloudera

Week 9 : Anomaly Detection, Time Series Analysis, Text Mining & NLP -

Session 1

- Text Analysis & NLP

Session 2- Anomaly Detection, Time Series Analysis

Session 3- Capstone Project – Project Selection (Opendatasets & problems)

Week 10 : Deep Learning & Reinforcement

Deepen machine learning skills with R and scikit learn. Deep learning with Theano, TenserFlow & Karas, Neural Networks learn, Convolutional Neural Networks

Session 1- Deep Learning and Neural Nets

Session 2- ANN in R & Python- CNN in Python

Session 3- Reinforcement Learning: Random Selection,UCB, Thompson Sampling in R- Reinforcement Learning: Random Selection,UCB, Thompson Sampling in Python- Tableau: Advanced Data Preparation- Capstone Project – Data Analysis ExecutionPlan (Data visualization tools)

Page 11: Data Science End to End€¦ · should focus on the following four pillars to have a successful journey toward analytics maturity. These four pillars are: Data scientists are like

This phase is concerned with the identiÿcation of all relevant data (ideally, raw, detailed data already validated) for the already formulated business problem, accessing it, assessing its quality, and transforming it into a usable form. Relevant data is often in detailed raw form with preliminary data quality checks already performed on it. However, in its current form, it is often not suitable for direct feeding into data mining algorithms. One often has to construct an Analytics Dataset (ADS) that includes many new features or variables˛ derived from the raw data. The ADS is usually kept in row-column format.

In classic enterprises, much of the relevant data is nowadays housed in an analytics data warehouse, a database, or a data lake. The relevant data is extracted, transformed, and loaded (ETLed) to the data warehouse from a variety of operational data sources, and its validity at this level is assured. This process has greatly reduced the data preparation burden for data scientists. In earlier days, data miners had to be concerned about every aspect of the data, from collection, integration, and cleansing from operational data sources, all the way through the generation and validation of ADS. Today, these tasks are routinely performed by IT data management/

engineering teams at a department or enterprise level. This enables data scientists to start from an uniÿed, integrated, and trusted data source. Thus, they can concentrate their e° orts on the design and development of ADS, which is often the most interesting part of the data preparation and use.

Structured data in a data warehouse is kept in normalized form using an appropriate data schema. “Early binding” of this data makes things simpler for analysis, as long as all relevant data (sources, records, columns), in its entirety, has been placed there. If new business problems arise, data miners can rely on having all the relevant data to choose from. In reality, however, due to the cost of the data warehouse, not all relevant data (even structured) could be placed in the data warehouse. As a result, the predeÿned schema is only designed to support the current business problems and what is known today. This will not be desirable when a business problem arises that requires the use of some data elements not present in the data warehouse. However, this model works ÿne for standard use cases of data mining and “business as usual”scenarios.

The ADS typically is a de-normalized dataset—in row and columns—created based on the relevant data: one row per entity to model. Its records cover many examples, from ideally all possible situations and outcomes, for the business problem at hand. Hence “the more data, the better is the rule” — meaning that more rows (entities) and

more columns (attributes) are useful. ADS columns represent potential in˝uen tial variables that could have relevance for di° erent business outcome(s). Creation of ADS is a time-consuming process.

One of the ideals of data science has been the ability to automatically generate relevant features from raw data (transactional or non-transactional). However, this process is often manual when new problems are to be solved. A popular approach, called kitchen sink, is to automatically (or manually) generate a large number of candidate features and then let the learning algorithm select the most signiÿcant features using one or more variable selection techniques. However, there is ultimately no replacement for the creativity, expertise, and knowledge of the data miner at this step. Recently, deep nets are trained on huge amount of images with a common goal that the hidden layers progressively learn useful low-level and high-level features that could be applicable to a wide range of image recognition tasks.

ADS is the input to the advanced data mining algorithms, and its creation process is where most of the e° ort for a complex data science exercise goes, assuming all relevant raw data is already available and validated. ADS preparation could consume 60%–90% of all e° orts during a data science project, and

this is when the relevant data is already coming from a trusted source such as an EDW. This statistic often surprises people who are new to the ÿeld. Business domain knowledge, in addition to data and modeling expertise, are essential to the success of this phase. Technically, a good knowledge of data manipulation tools and techniques is essential in the creation of ADS. Examples of popular operations include sampling, balancing, aggregation, sorting, merging, appending, ÿltering, proÿling, selecting records or columns, and sub-setting.

Variable creation is the ÿnal step for creating an ADS. In addition to technical skills, the creation of ADS requires creativity, intuition, and experience; it is an art as well as a science. Poor ADS design is futile no matter how sophisticated the chosen data mining algorithms are or how relevant the raw data were. Often the technical failures in data mining can be directly traced back to poor ADS design and creation. The assumptions used for ADS construction, the computational processes to generate it, and the data validity checks to ascertain its health before modeling are all important contributors to the success or failure of the whole exercise.

The ÿnal variables (also called attributes or features) in ADS are typically inspired by the business domain knowledge. However, that is just the starting point, not the ÿnish line. For example, to detect fraud, ADS will be created ÿrst by deriving variables that are already known or expected to have associations and relationships with detecting fraud.

Although this typically comes with domain expertise, data miners always think out of the box and will derive/include other variables that have not been suggested or thought of by domain experts. They rely on powerful variable selection techniques for ÿnal consideration of useful variables in the model(s). Data preparation is both compute- and data-intensive.

High-performance computing platforms such as MPP databases, dedicated multiprocessor analytics servers, or computer clusters with fast shared storage have been critical to the robust and timely creation of ADS.

analysis, community detection, and PageRank. A variety of feature extraction algorithms are used for creating new variables (attributes) as a combination of the original variables and/or records (transactions) across time. A data mining model (or learner) is an e˙cien t and compact representation of a very large data collection with speciÿc goals (represented by ADS) using mathematical concepts and language. A good data mining model is one that learns e° ectively from the data, using a chosen learning algorithm, and then generalizes what it has learned to other di° erent but similar observations.

In the A-phase, many learners are built and compared with each other based on a set of selected business metrics. A champion model (which could be a set of models) that provides the highest business value will be selected for ÿnal deployment. The champion model provides the best business value based on the desired business metrics. The model development and selection process is iterative; this can e° ectively be automated as long as ADS is ÿxed. Today, due to signiÿcant advances in data science tools and expertise, modeling or learning is the most straightforward part of all data sciences phases.

Learners can be categorized into two major types: ÿxed and ˝e xible (Non-parametricˇ models are included in this.) Fixed models refer to those whose mathematical representation has a ÿxed structure and size, given a ÿxed number of inputs. Flexible models are those whose internal mathematical structure

can vary with the data, while the number of inputs stays the same. Variable presentation is essential for algorithms such as linear regression family, neural networks, and many others that ingest only numeric variables as inputs. They are also of great importance for linear algorithms (such as linear and logistic regression) to capture nonlinearities inherent in the data. Some nonparametric ˝e xible learners such as decision trees and random forests can directly ingest nominal variables, and inherently are less sensitive to such transformations. In almost all cases, such transformations help in understanding the behavior of the model and the interpretation of its results. Decision trees of singlevariables are sometimes used for the grouping or binning of variables.

In very special situations, one may develop an ensemble model—a group of di° erent models that all participate in making a decision. Bagging, boosting, and fusion (stacking) are some of the techniques used to develop such models. In bagging, one generates random variations of the training data by resampling, develops a learner for each sample, and then combines the results by voting or averaging. Empirical results have shown that this process works well, especially in the case of weak learners such as

decision trees. In boosting, one assigns a variable weight to each training example such that each new learner focuses more on the examples the other learners have failed to accurately predict. In fusion or stacking, the predictions (outputs) of di° erent learners are input to a “super learner” to combine them optimally.

Net˝ix pr ize competition was an extreme and unrealistic example of an ensemble approach in which the participating teams discovered that they could improve their prediction results by combining their models with other participating teams. The top two winners had built ensembles of over 100 learners, where even fusing the top two winning ensembles could further improve the prediction performance. However, most of the teams were academically oriented and had an inÿnite amount of time (three years) and resources at their disposal to do this. This is a luxury that does not exist in real-world business scenarios, where decisions have to be made much faster or opportunities will be lost. A popular and practical ensemble model is random forest, which uses a combina-tion of bagging and random selection of features to combine hundreds and thousands of smaller weak learners (decision trees).

Week 11: Model Assessment, Validation, Optimization & Tuning

Introduction to Cost Function, Object Function, Model Optimization, Model Tuning, Regularization, Gradient Boosting, Grid and Random Search. Analyze the performance of each algorithm

Session 1- Session Assessment – CM, ROC, Rank-Ordered Approach, R2, MSE, MAE, MedianError, Median Absolute error, Correlation

Session 2- k-Fold Cross Validation & Grid Search in R- k-Fold Cross Validation & Grid Search inPython- XGBoost, AdaBoost in R & Python

Session 3- Lasso Regression in R & Python- Ridge Regression in R & Python- Elastic Net Regression in R & Python- Rapid Miner: Cross-Validation- Capstone Project – Presentation (Researchand trends in data analytics)

Week 12: Predictive Analytics, Cognitive Computing & Big Data

Introduction to Predictive Analytics, Introduction to SAS for Enterprise Miner, Decision Tree classification with SAS, Decision Tree Regression with SAS, Neural Network with SAS.

Learn the concepts of high-performance computing with parallel computing and Cognitive computing, and Watson Analytics. Introduction to Map Reduce, Hadoop, Hive, Spark, and Spark MLlib

Week 13: Resume and Interview Preparation

Students will have access to more than 250 real interview questions followed by a session with answers and resume preparation.

Session 1- Interview Preparation

Session 2- Resume Preparation

Session 3- Job Placement and Placement Guidance

Class Price

The class fee is $2499

You can repeat multiple times at no cost

Class Interest Group Registration

http://www.bigbang-datascience.com/register-for-a-course

[email protected]

Page 12: Data Science End to End€¦ · should focus on the following four pillars to have a successful journey toward analytics maturity. These four pillars are: Data scientists are like

Machine learning is used for both structured and unstructured data such as text, speech, audio, and images. Aside from learning, there are other types of algorithms used in data mining. For e.g., iterative graph algorithms are used for link

analysis, community detection, and PageRank. A variety of feature extraction algorithms are used for creating new variables (attributes) as a combination of the original variables and/or records (transactions) across time. A data mining model (or learner) is an e˙cien t and compact representation of a very large data collection with speciÿc goals (represented by ADS) using mathematical concepts and language. A good data mining model is one that learns e° ectively from the data, using a chosen learning algorithm, and then generalizes what it has learned to other di° erent but similar observations.

In the A-phase, many learners are built and compared with each other based on a set of selected business metrics. A champion model (which could be a set of models) that provides the highest business value will be selected for ÿnal deployment. The champion model provides the best business value based on the desired business metrics. The model development and selection process is iterative; this can e° ectively be automated as long as ADS is ÿxed. Today, due to signiÿcant advances in data science tools and expertise, modeling or learning is the most straightforward part of all data sciences phases.

Learners can be categorized into two major types: ÿxed and ˝e xible (Non-parametricˇ models are included in this.) Fixed models refer to those whose mathematical representation has a ÿxed structure and size, given a ÿxed number of inputs. Flexible models are those whose internal mathematical structure

can vary with the data, while the number of inputs stays the same. Variable presentation is essential for algorithms such as linear regression family, neural networks, and many others that ingest only numeric variables as inputs. They are also of great importance for linear algorithms (such as linear and logistic regression) to capture nonlinearities inherent in the data. Some nonparametric ˝e xible learners such as decision trees and random forests can directly ingest nominal variables, and inherently are less sensitive to such transformations. In almost all cases, such transformations help in understanding the behavior of the model and the interpretation of its results. Decision trees of singlevariables are sometimes used for the grouping or binning of variables.

In very special situations, one may develop an ensemble model—a group of di° erent models that all participate in making a decision. Bagging, boosting, and fusion (stacking) are some of the techniques used to develop such models. In bagging, one generates random variations of the training data by resampling, develops a learner for each sample, and then combines the results by voting or averaging. Empirical results have shown that this process works well, especially in the case of weak learners such as

decision trees. In boosting, one assigns a variable weight to each training example such that each new learner focuses more on the examples the other learners have failed to accurately predict. In fusion or stacking, the predictions (outputs) of di° erent learners are input to a “super learner” to combine them optimally.

Net˝ix pr ize competition was an extreme and unrealistic example of an ensemble approach in which the participating teams discovered that they could improve their prediction results by combining their models with other participating teams. The top two winners had built ensembles of over 100 learners, where even fusing the top two winning ensembles could further improve the prediction performance. However, most of the teams were academically oriented and had an inÿnite amount of time (three years) and resources at their disposal to do this. This is a luxury that does not exist in real-world business scenarios, where decisions have to be made much faster or opportunities will be lost. A popular and practical ensemble model is random forest, which uses a combina-tion of bagging and random selection of features to combine hundreds and thousands of smaller weak learners (decision trees).

ABOUT BBDS

BBDS trains students (individuals and corporate) on Data Science and Data Analytics to help you uncover actionable insights to drive competitive advantage and capture business value. We train on integrating and ope-rationalizing data analytics solutions, enabling you to gain visibility into previously opaque or hard to measure processes. This empowers you to make smarter business decisions.

Our team of data experts, consultants and data scientists leverage proven analytics methodologies, tools and best practices to define the right analytics solutions for you, that solve complex business challenges/ specific use cases and drive future growth.

Mo Medwani Data Scientist

• 12+ Years of experience in IT (Service Delivery Management)

• BBDS Founder

• 7+ Years of experience in Data Analytics

• 3+ Years of experience in Data Science and related technologies

• 3 Master degrees (MBA, MS-IT, MS- Data Science)

• Ph.D. Candidate (Data Science and Data Analytics)

• Training Individuals and Corporations in Data Science since Dec 2016

Big Bang Data Science Solutions

[email protected]


Recommended