Tools and Methods for Big Data Analytics by Dahl Winters

Tools and Methods for Big Data Analytics

One hour of everything you need to know to navigate the data science jungle

by Dahl Winters, RTI International

Overview• What is Big Data Analytics• What Tools to Use When• Most Common Hadoop Use Cases• Geospatial Analytics

o NoSQL and Graph Databaseso Machine Learning

• Classification• Clustering

o Deep Learning

Resourceshttp://www.scoop.it/u/dahl-winters

Big Data Analytics

Geospatial Analytics

Social Media Analytics

Analytics

Text Analytics

Descriptive Predictive Prescriptive

Sentiment Analysis

Image Processing

Data Science

Statistics

Machine Learning

Software Engineering

Lots of Complex Data

Network Analytics

What is Hadoop Good For• Essentially, anything involving complex data and/or

multiple data sources requiring batch processing, parallel execution, spreading data over a cluster of servers, or taking the computation to the data because the data are too big to move

• Text mining, index building, graph creation/analysis, pattern recognition, collaborative filtering, prediction models, sentiment analysis, risk assessment

• If your data are small, Hadoop will be slow – use something else (scikit-learn, R, etc.)

What is Hadoop?

When to Use What• Depends on whether you need real-time analysis or not

o Affects what products, tools, hardware, data sources, and data frequency you will need to handle

• Data frequency and sizeo Determine the storage mechanism, storage format, and necessary preprocessing

toolso Examples: on-demand (social media data), continuous real-time feed (weather

data, transaction data), time-based data (time series)

• Type of datao Structured (RDBMS)o Unstructured (audio, video, images)o Semi-structured

Decision TreeHow big is your data?

What size queries?

Response time?

Less than 10 GB 10 GB < x < 200 GB More than 200 GB

Single element at a time

One pass over all the data

Multiple passes over big chunks

Less than 100 s Don’t care, just do it

Big Storage Streaming

Small Data Methods

Batch Processing

Impala, Drill, Titan

Big Data Considerations

http://www.ibm.com/developerworks/library/bd-archpatterns1/

Survey of Use Cases

9 general use cases for big data tools and methods

2 real-time analytics tools

8 MapReduce use cases – what you can use Hadoop for

1 geospatial use case

Use Cases1. Utilities want to predict power consumption

o Use machine-generated datao Smart meters generate huge volumes of data to analyze and power grid contains

numerous sensors monitoring voltage, current, frequency, etc.

2. Banks and insurance companies want to understand risko Use machine-generated, human-generated, and transaction data from credit card

records, call recordings, chat sessions, emails, and banking activityo Want to build a comprehensive data picture using sentiment analysis, graph

creation, and pattern recognition

3. Fraud detectiono Machine-generated, human-generated, and transaction datao Requires real-time or near real-time transaction analysis and the generation of

recommendations for immediate action

Use Cases4. Marketing departments want to understand customers

o Use web and social data such as Twitter feedso Conduct sentiment analysis to learn what users are saying about the company

and its products/services; sentiment must be integrated with customer profile data to derive meaningful results.

o Customer feedback may vary according to demographics, which are geographically uneven and thus have a geospatial component

5. They also want to understand customer churno Use web and social data, along with transaction datao Build behavioral models including social media and transaction data to predict

and manage churn by analyzing customer activity. Graph creation/traversal and pattern recognition may be involved.

6. They may also just want to get insights from the datao Use Hadoop to try out different analyses on the data to find potential new

patterns/relationships that yield additional value

Use Cases7. Recommendations

o If you bought this item, what other items might you buy?o Collaborative filtering = using information from users to predict what similar users

might like.o Requires batch processing across large, distributed datasets

8. Location-Based Ad Targetingo Uses web and social data, perhaps also biometrics for facial recognition; also

machine-generated data (GPS) and transaction datao Predictive behavioral targeting and personalized messaging – companies can

use facial recognition technology in combination with a photo from social media to make personalized offers based on buying behavior and location

o Serious privacy concerns

9. Threat Analysiso Pattern recognition to identify anomalies

Real-Time Analytics• Streaming data management is the only technology

available to deliver low-latency analytics at large scale• Scale by adding more servers• Twitter Storm – can be used with any programming

language. For online machine learning or continuous computation. Can process more than a million tuples per second per node.

• LinkedIn Samza – built on top of LinkedIn’s Kafka messaging system

MapReduce Use Cases1. Counting and Summing

o N documents, each with a set of terms and we want to calculate a total number of occurrences of each term in all N documents

2. Collatingo A set of items each have a property and we want to save all items with that

property into one file or perform some computation requiring all property-containing items to be processed as a group (i.e. building inverted indices)

3. Filtering, Parsing, and Data Validationo We want to collect all records that meet some condition or transform each record

into another representation (i.e. text parsing, value extraction, conversion from one format to another)

4. Distributed Task Executiono Any large computational problem that can be divided into multiple parts and

results from all parts can be combined into a final result

MapReduce Use Cases5. Sorting

o We want to sort records by some rule or process the records in a certain order

6. Iterative Message Passing (Graph Processing)o Given a network of entities and relationships between them, calculate each

entity’s state based on the properties of surrounding entities

7. Distinct Values (Unique Items Counting)o A set of records contain fields A and B, and we want to count the total number of

unique values of field A, grouped by B

8. Cross-Correlationo Given a list of items bought by customers, for each pair of items calculate the

frequency that customers bought those items together.

Geospatial Analytics• Question: What defines a community?

• Tools and Methodso Graph Databaseso Classification Algorithms to Identify Characteristics of Community Memberso Clustering Algorithms to Identify Community Boundaries

• Base Dataseto Synthetic Population Household Viewero https://www.epimodels.org/midas/synthpopviewer_index.do

https://www.epimodels.org/midas/synthpopviewer_index.do

Graph Databases• Think of nodes as points, edges as lines connecting the

points• Nodes can have attributes (properties); edges can have

labels• In the Hadoop ecosystem: Giraph, Titan, Faunus• Giraph: in-memory, lots of Java code• Titan: database allowing fast querying of large,

distributed graphs; choice of 3 storage backends• Faunus: graph analytics engine performing batch

processing of large graphs; fastest with breadth-first searches

Identify This

Synthetic Population Household Viewer

http://portaldev.rti.org/10_Midas_Docs/SynthPop/portal.html


Machine Learning Algorithm Roadmap

http://peekaboo-vision.blogspot.de/2013/01/machine-learning-cheat-sheet-for-scikit.html

Classification Algorithms• kNN, Naïve Bayes, Logistic Regression, Decision Trees,

Random Forests, Support Vector Machines, Neural Networks, oh my! How to decide?

• Look at the size of your training seto Small: high bias/low variance classifiers like Naïve Bayes are better since the

others will overfit, but high bias classifiers aren’t powerful enough to provide accurate models.

o Large: low bias/high variance classifiers such as kNN or logistic regression are better because they have lower asymptotic error

• When to use kNNo Personalization tasks – might employ kNN to find similar customers and base an

offer on their purchase behaviorso Have to decide what k to use – vary k, calculate the accuracy against a holdout

set, and plot the results

Classification Algorithms• When to use Naïve Bayes

o When you don’t have much training data; Naïve Bayes converges quicker than discriminative models like logistic regression

o Any time – this should be a first thing to try especially if your features are independent (no correlation between them)

• When to use Logistic Regressiono When you don’t have to worry much about features being correlatedo When you want a nice probabilistic interpretation, which you won’t get with

decision trees or SVMs, in order to adjust classification thresholds or get confidence intervals

o When you want to easily update the model to take in new data (using gradient descent), again unlike decision trees or SVMs

• When to use Decision Treeso They are easy to interpret and explain, but easy to overfit. To solve that problem,

use random forests instead.

Classification Algorithms• When to use Random Forests

o Whenever you think about using decision trees (random forests almost always have lower classification error and better f-scores, and almost always perform as well or better than SVMs but are far easier to understand).

o If your data are very uneven with many missing variableso If you want to know which features in the data set are importanto If you want something that will train fast and that will be scalableo Logistic Regression vs. Random Forests: both are fast and scalable; the latter

tends to beat the former in terms of accuracy

• When to use SVMso When working with text classification or any situation where high-dimensional

spaces are commono Advantage: high accuracy, generally superior in classifying complex patterns.

Disadvantage: memory intensive. Unsuitable for large training sets.

Classification Algorithms• When to Use Neural Networks

o Slow to converge, hard to set parameters, but good at capturing fairly complex patterns. Slow to train but fast to use; unlike SVMs the execution speed is independent of the size of the data it was trained on.

o MLP neural network – well-suited for complex real-world problems – on average, superior to both SVM and Naïve Bayes. However, cannot easily understand the model built for classifying.

• General Pointso Better data often beats better algorithms – designing good features goes a long

way.o With a huge dataset, choice of classification algorithm might not really affect

performance much, so choose based on speed or ease of use instead.o If accuracy is paramount, try many different classifiers and select the best one by

cross-validation, or use an ensemble method to choose them all.

Clustering Algorithms

http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html

Clustering Algorithms• Canopy clustering

o Pre-clustering algorithm, often used prior to k-means or hierarchical in order to speed up clustering operations on large data sets and potentially improve clustering results

• DBSCAN/OPTICS• Density-based spatial clustering of applications with noise – finds

density-based clusters in spatial data• OPTICS – ordering points to identify the clustering structure -

generalization of DBSCAN to multiple ranges so meaningful clusters can be found in areas of varying density

• Hierarchical clustering• K-means clustering

o Most common

• Spectral clusteringo Dimensionality reduction before clustering in fewer dimensions

Clustering Decision TreeDo you want to define the number of clusters beforehand?

How many clusters would you have?

no yes

A few A medium number

A large number

Spectral clustering K-means

Hierarchical clustering

Do your points have varying densities?

yesno

DBSCAN OPTICS

Deep Learning• Why?

o Computers can learn without being taughto Can adapt to experience rather than being dependent on a human programmero Think of the baby that learns sounds, then words, then sentences – must start at

low-level features and graduate to higher-level representations

• What?o Essentially, layers of neural networkso Restricted Boltzmann Machines, Deep Belief Networks, Auto-Encoderso http://www.meetup.com/Chicago-Machine-Learning-Study-Group/files/

• Exampleso Word2vec – pre-packaged deep learning software that can recognize the

similarities among words (countries in Europe) as well as how they’re related to other words (countries and capitals)

o AlchemyAPI – for image recognition of common objects

http://www.meetup.com/Chicago-Machine-Learning-Study-Group/files/

http://www.meetup.com/Chicago-Machine-Learning-Study-Group/files/

http://www.youtube.com/watch?v=n1ViNeWhC24


Hadoop Connectors• R: rmr2 allows MapReduce jobs from R environment;

bridges in-memory and HDFSo Non-Hadoop R for Big Data: pbdR (programming with big data in R) – allows R to

use large HPC platforms with thousands of cores by providing an interface to MPI, NetCDF4, and more

• MongoDB and Hadoop: Mongo-Hadoop 1.1• Pattern: migrating predictive models from SAS,

Microstrategy, SQL Server, etc. to Hadoop via PMML (XML standard for predictive model markup)

• .NET MapReduce API for Hadoop• Python for Hadoop

Python-Hadoop Options

http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/


Python-Hadoop Benchmarks



Questions?Dahl Winters, [email protected]

http://www.scoop.it/u/dahl-winters

Date post:	27-Jan-2015
Category:	Technology
Upload:	melinda-thielbar
View:	103 times
Download:	2 times

Tools and Methods for Big Data Analytics by Dahl Winters

Technology