Date post: | 07-Aug-2015 |
Category: |
Data & Analytics |
Upload: | rohit-kulkarni |
View: | 396 times |
Download: | 1 times |
Scaling up with Hadoop
& Parallel Processing Framework (Banyan)
LatentView Analytics
2015
Agenda
1 Introducing LatentView Analytics
2 Data Processing Frameworks and a brief history of Hadoop
3 Solving the Big Data Problem with Hadoop, Spark & Storm
4 The Unstructured Maze
5 Banyan – A Parallel Processing Framework
6 Demo
Agenda
Introducing LatentView Analytics
2 Data Processing Frameworks and a brief history of Hadoop
3 Solving the Big Data Problem with Hadoop, Spark & Storm
4 The Unstructured Maze
5 Banyan – A Parallel Processing Framework
6 Demo
LatentView Analytics
25
Developed Solutions for
Fortune 500 Firms 500
Over
People Strong
1000Experience in Analytics
With more than
years of combined
Followers on Social Media
30KEngaging
LatentView in the News
LatentView won the Deloitte Technology Fast 50 India awards for 6 consecutive years (2009 – 13)
‘Top Innovator’ awarded to LatentView by Developer Week (Conference & Festival 2013)
LatentView was a Top Finalist in the ‘We Love Our Workplace 2013’ category. Reflecting global
recognition of our workplace culture.
LatentView is Advanced Consulting Partner with Amazon Web Services
Build Reporting and Analytics Centers of
Excellence (COEs)
Analyze Business problems both Qualitatively &
Quantitatively and provide actionable insights
Onsite-Offshore Global Delivery model that helps
in-house teams do more with less
Provide Thought Leadership in Data Science
Services provided by LatentView
LatentView is an Alliance Partner with Tableau
Industry Specific Analysis: Market basket Analysis, Campaign Analytics, Fraud Detection, Survey analytics, Customer Life time Value, Demand Forecasting, Price Optimization, Social Media Analysis
Mobile
PC
Tablet
Signal &
Wireless Data
Servers &
Cloud
Social
User Profile
Surveys & Reviews
Travel &Location
Performance
System Logs &
Database data
Unstructured Data
Work @ LatentView
Different Data Sources & Formats Technology & Predictive Analysis Tool Kits
Data Engineering & Advanced Analytics
Infrastructure
Databases
Predictive Modelling
CXO Dashboards & Visualization
Agenda
1 Introducing LatentView Analytics
Data Processing Frameworks and a brief history of Hadoop
3 Solving the Big Data Problem with Hadoop, Spark & Storm
4 The Unstructured Maze
5 Banyan – A Parallel Processing Framework
6 Demo
Data Processing Frameworks
Data Processing Frameworks
Distributed Processing Parallel Processing
Distributed Processing Characteristics• Master/Slave or Peer-Peer architecture• Data Replication and redundancy• Fault tolerant, Shared Memory• Centralized Job distribution• Efficient Job scheduling• Coordinated Resource Management• Process Structured & Semi-Structured data• Examples:
• Hadoop, Spark, Storm
Parallel Processing Characteristics• Shared Nothing Massively Parallel architecture• Common or Independent Storage• Independent Memory & Processor space• Random Job distribution• Self managed resources & worker• Dynamic load balanced cluster• Process Unstructured data• Examples:
• Banyan
The Search Context
• Gerard Salton, Father of Modern Search Technology
• Salton’s Magic Automatic Retriever of Text
• Inverse Document Frequency (IDF), Term Frequency (TF), term discrimination values
SMART (informational retrieval system)
Project Xanadu
ARPANet
Archie Query Form
FTP & WWW
Ask, AltaVista, Yahoo, Google, Bing
• Ted Nelson – Coined the Term Hyper Text
• Create a Computer Network with a simple UI to solve social problems like attribution
• Inspired creation of WWW
• Advanced Research Projects Agency Network
• Led to Internet
• First Implementation of TCP/IP stack
• Document Search & Find Tool
• Script-based data gatherer with a regular expression matcher for retrieving file
• A database of web filenames which it would match with the users queries
• Enter Tim Berners Lee
• httpd, TCP, DNS – Connected it all
• A database of web filenames which it would match with the users queries
What is the biggest problem that the Search Engines of the last two decades solve?
Project Lucene was written by Doug Cutting in 1999. It was written purely in JAVA
It was written with an intention of helping in creating an open source web engine
Lucene is just an indexing and search library and does not contain crawling and HTML parsing functionality.
Building Lucene
Ported Nutch Algorithms to Hadoop
Yahoo Hires Doug Cutting!
Apache Hadoop comes into picture to support Map Reduce & HDFS
Yahoo’s Grid team adopts Hadoop
Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours
Algorithms in Hadoop
Yahoo! set up a Hadoop research cluster—300 nodes. Also, Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark).
Research cluster upgraded to 600 nodes
In 2008 - Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes.
Yahoo! Announced that its production search index was being generated by 10000-core Hadoop
Hadoop Benchmarks!
As of 2008 - Loading 10 terabytes of data per day on to research clusters
17 clusters with total of 24,000 nodes
In 2009 – Won the minute sort by sorting 500GB in 59 seconds(on 1,400 nodes) and 100TB sort in 173 minutes(on 3,400 nodes)
Last.fm, Facebook, New York Times
Hadoop In Action!
Lucene was not able to crawl or parse HTML by itself. So, a sub project was developed under it which was called Nutch
Doug Cutting & Mike Cafarella
Highly modular architecture, allows developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.
Building Nutch
Google File System paper was presented
NDFS was developed based on the paper
Google released another paper on MapReduce that Revolutionized the Hadoop development
MapReduce tries to collocate the data with the compute node, so data access is fast since it is local. This is known as Data Locality
The Heart of Hadoop – Distributed File System & Map Reduce
A Brief History of Hadoop
Year: 1999
Key Challenges Addressed
Efficient Indexing of the results for easy retrieval
Year: 2002
Efficient Crawling of the World wide web at Scale
Year: 2003
File System
Year: 2005
Year: 2006Algorithms in Hadoop
Cost & Time efficient hardware andsoftware
Year: 2008 Hadoop in Action!
Distributed Processing, Data Warehousing & Analysis
UI & Tools
Hadoop Distributions
Apache Hadoop, Apache Bigtop
Hadoop as a Platform
Cloudera, HortonWorks, MapR
Hadoop as a Service
GoGrid, Qubole, Altiscale, AWS EMR, Azure HDInsight, IBM BigInsights
Hadoop Services Ecosystem
Master Slave Architecture
Batch Vs Real Time Stream processing
Relational database Vs NoSQL database
Data fragmentation and management
Parallel processing job requirements
Efficient energy management
Elephant be it, has it‘s limitations!
Applications of Big Data Processing
Predict Galaxy types and shapes
Analyzing Life Forms
Weather Forecast
Traffic Management
Disaster Recovery
Personal Health Care
Science & Engineering Environmental Management Intelligent Devices & IoT
Agenda
1 Introducing LatentView Analytics
2 Data Processing Frameworks and a brief history of Hadoop
Solving the Big Data Problem with Hadoop, Spark & Storm
4 The Unstructured Maze
5 Banyan – A Parallel Processing Framework
6 Demo
Apache Hadoop & Family
Identify the Apache Hadoop Components!
The Apache Hadoop Stack
Hadoop Distributed File System
YARN/Map Reduce V2
Pig Hive Mahout Oozie
Hbase
Flume
Sqoop
Hadoop User Experience (HUE)
ML Workflow
Columnar data store
Scripting SQLC
oo
rdin
atio
n
Zoo
Ke
ep
er
DataExchange
Log Control
Walking the Talk with Hadoop – Let’s Architect…
People you may know on LinkedIn.You might know me, if people that you know, know me!
foreach u in UserList:
foreach x in Connections(u):
foreach y in Connections(x):
if(y not in Connections(u)):
Count(u, y)++;
Sort (u, y) in descending order of Count(u, y);
Choose Top 3 y;
Store (u, {y0, y1, y2..}) for serving;
Simplest ever Map-Reduce example
Mapper is a function that transforms the input data in required format, without aggregating.
Mapped_List = Mapper(Input_List)
Ex:Input_List = (1, 2, 3, 4, 5, 6, 7, 8, 9)Mapper = Square()Mapped_List = Square(Input_List)Mapped_List= Square(1, 2, 3, 4, 5, 6, 7, 8, 9)
Mapped_List= (1, 4, 9, 16, 25, 36, 49, 64, 81)
What is a Map ? What is a Reduce?
Reducer is a function that aggregates the input data in required format.
Output_List = Reducer(Mapped_List)
Ex:Mapped_List= (1, 4, 9, 16, 25, 36, 49, 64, 81)Reducer = Sum()Output_List = Sum(Mapped_List)Output_List= Sum(1, 4, 9, 16, 25, 36, 49, 64, 81)
Output_List = 285
Characteristics of Map Reduce
Map is inherently parallel process, where each list element is processed independently
Reduce is inherently sequential, unless multiple lists are processed at a time –in parallel
Grouping is done to produce multiple lists to avail parallelism
Input Partition Map Sort Shuffle Reduce Output
Native MapReduce , Hadoop Streaming
Simulating Map Reduce
mc:~$ cat /var/log/auth.log* | grep "session opened" | cut -f11 -d' ' | sort | uniq –c
What do each of above commands do? What is the output?
Feb 1 18:17:01 ip-10-218-136-14 CRON[21353]: pam_unix(cron:session): session opened for user root by (uid=0)Feb 1 18:30:01 ip-10-218-136-14 CRON[21373]: pam_unix(cron:session): session opened for user ubuntu by (uid=0)Feb 1 18:39:01 ip-10-218-136-14 CRON[21387]: pam_unix(cron:session): session opened for user root by (uid=0)Feb 1 19:09:01 ip-10-218-136-14 CRON[21427]: pam_unix(cron:session): session opened for user root by (uid=0)
mc:~$ cat /var/log/auth.log* | grep "session opened" | less
mc:~$ cat /var/log/auth.log* | grep "session opened" | cut -f11 -d' ' | sort | uniq
mc:~$ cat /var/log/auth.log* | grep "session opened" | cut -f11 -d' ' | sort | uniq -c
28321 root86 ubuntu
47635 user
The MapReduce Process
Map
In (Key1, Value1)Out List(Key2, Value2)
Input --- (Filtering, Transformation) --- Output
Reduce
In List(Key2, List(Value2))Out List(Key3, Value3)
Aggregation
Shuffle
In (Key2, Value2)Out Sort(Partition(Key2, List(Value2)))
Movement / copy of data
The MapReduce Process with a Deck of Cards!
Map in Parallel Shuffle/Group Reduce
Sum()
Sum()
Sum()
Sum()
Sum()
Hadoop Security
Centralized framework for collecting access audit history and easy reporting on the data.
Provides Kerberos based authentication. Kerberos can be connected to corporate LDAP environments to centrally provision user information.
Supports encrypting data when it is is transferred and at rest and masking capalbilities for desenstizing PII information
Ensures users have access to only to data as per corporate policies. Provides fine-grained authorization via file permissions in HDFS, recsource level access control for YARN & MapReduce
Security requirements consistently applied across the platform and can be mangaged centrally with a single interface
Audit
Data Protection
Authorization
Authentication
Centralized Seurity Administration
Difference between Authentication & Authorization ?
A Brief note on Spark & Storm
What do you think is the most time consuming aspect of Hadoop Processes?
How to improve the I/O Limitation?
Result: Faster Analytics
How to achieve event driven real time analytics?
Result: Highly customized service response
Hadoop Distributions & Service Providers
Agenda
1 Introducing LatentView Analytics
2 Data Processing Frameworks and a brief history of Hadoop
3 Solving the Big Data Problem with Hadoop, Spark & Storm
The Unstructured Maze
5 Banyan – A Parallel Processing Framework
6 Demo
Data Deluge – A big problem
PC
Tablet Mobile
SocialSearch&
E-Commerce
Tree based Unstructured Feature Extraction
Panel & Web Logs
Social
Rules
Engine
Data
Parser
• Tweets• Comments• Likes• Shares• Blogs• Reviews
• Clickstream• HTML• Images• Audio*• Video*
Feature Type Detail
Feature 1 Image 600*400
Feature 2 Link #
Feature 3 Price 200$
Feature 4 Star 3.5
Tweet Time View
Tweet1 12:00 Positive
Tweet 2 12:05 Neutral
Tree Based Parser
Agenda
1 Introducing LatentView Analytics
2 Data Processing Frameworks and a brief history of Hadoop
3 Solving the Big Data Problem with Hadoop, Spark & Storm
4 The Unstructured Maze
Banyan – A Parallel Processing Framework
6 Demo
Banyan – Parallel processing framework at scale
• Is your data unstructured ? Ex: HTML, Images, URLs, Audio, Video, Documents, Text
• Is processing each input independent of processing other input? Ex: Compressing one image is independent of next image
• Do you need to solve the two problems above at web scale? Ex: say 1 Million documents to processed in less than 1 hour
We handle what Hadoop can’t handle!
Rather, We handle what Hadoop isn’t supposed to handle – Parallel Processing & Unstructured data!
Banyan is a parallel processing framework well integrated with cloud platform of your choice!
Follow us here:Banyan – Embarrassingly Parallel Processing Framework Linked In Group
http://www.growbanyan.com
Email us : [email protected]
Banyan Vs Hadoop (Yes or No type of comparison)
Characteristics Banyan Hadoop
Job Type Embarrassingly Parallel Processing Distributed Processing
Master Slave Architecture
Shared Nothing Architecture
Data Replication
Fault tolerance
Coordinated Job Distribution
Dynamical Load Balancing
Rescheduling Job Failures
Process Structured data
Process Unstructured data
Note:The core advantage of Banyan is best utilized when Data Processing & Analysis (Aggregation) are executed in a decoupled fashion for jobs that can be processed in parallel
Follow us here!
http://www.growbanyan.com
Banyan – Embarrassingly Parallel Processing Framework Linked In Group
Email us : [email protected]