Date post: | 21-Jan-2017 |
Category: |
Data & Analytics |
Upload: | datamatics-global-services-limited |
View: | 828 times |
Download: | 1 times |
Address Performance Issues – Reporting and Portal Latency
Implement Product Recommendation Engine
Implement highly-scalable Big Data solution
RDBMS –Data Migration and Decommission
Implement Models for Predictive Analytics
Implement Visualization Platform – Standard Dashboards
Right Approach
Datamatics Solution
Why It is Required ?
Increased traffic with the help of discovery tools
Improved decision-making and better ROI
Deliver right content to right audience
Manage Inventory and Supply efficiently
Enhance Customer Experience
Quick Turn-around-time(TAT) for data analysis
Content-Based Collaborative filtering
Datamatics Solution
Analytical Reports
Historical data
Non-volatile
Fast retrievals
Larger volumes of data
Operational Real-time
Reads/write/updates
Current/recent data
Fast inserts/updates
Large volumes of data
Approach: Combination of both operational and analytical framework in a distributed environment as opposed to a single machine installation would best suit the requirement.
Technology Stack: A hybrid solution that includes an integration of Hive and HBase.
Solution Requirements
Integration between HDFS/Hive and HBase
HBase: Row-level updates solve Data Duplicity • HBase is a scale out table store
• HBase does not allow duplicate rows
• Very high rate row-level updates over massive data volume
• Allows fast random reads and writes
• Keeps recently updated data in memory
• Incrementally rewrites data to new files
• Splits & merges intelligently based on distribution changes
Hive: Solves High Volume & Analytics Need • Hive storage is based on Hadoop’s underlying append-only file-
system architecture
• Ideal for capturing and analyzing streams of events
• Can store event information for hundreds of millions of users
HDFS: • Handles high volume of multi-structured data.
• Infinitely scalable at a very low-cost
Challenges with HDFS/Hive and HBase HBase: Row-level updates solve Data Duplicity
• Right Design: To ensure high throughput and low latency the schema has to be designed right
Other challenges include • Data model
• Designing based on derived query patterns
• Row key design
• Manual splitting
• Monitoring Compactions
Hive: Solves High Volume & Analytics Need
• Keeping the warehouse up to date
• Has append-only constraint
• Impossible to directly apply individual updates to warehouse tables
• Optionally allows to p`eriodically pull snapshots of live data and dump them to new Hive partition which this is a costly operation.
• Can be done utmost daily once, and leaves the data to be stale.
HDFS:
• Cannot handle high velocity of random reads and writes
• Fast random reads require the data to be stored structured(ordered)
• Unable to change a file without completely writing it
• Only way to a modify file stored without rewriting is appending
Hive
Storm
+
HBase
HDFS
+
+
Lambda Architecture
• Lambda architecture is a modern big data architecture framework for distributed data processing that is fault-tolerant, against both hardware failures and human errors.
• It also serves wide range of work-loads that need low-latency reads and writes at high throughputs
SERVING
QUERY
Serving layer is used to index batch views. • Ad-hoc batch views can be queried with low-latency • Hive is used for Batch views • HBase is used for Real-time views. • Hive and HBase integration happens in this layer.
• Batch views and Real-time views are merged in this layer.
• Merged views are implemented using HBase
Majorly handles two functions • Managing master dataset. (append-only set of raw data). • Pre-compute arbitrary query functions (batch views) • Hadoop’s HDFS is used in this layer
SPEED
DATA STREAM
BATCH
Majorly handles two functions • Managing master dataset. (append-only set of raw data). • Pre-compute arbitrary query functions (batch views) • Hadoop’s HDFS is used in this layer
• Speed layer deals with recent data only.
• High latency of updates due to the batch layer in the serving layer are addressed in the Speed Layer.
• Storm is used in this layer.
Proposed Architecture – Option 1 - Open Source
Features MapR Apache
Data Protection
Complete snapshot recovery
Inconsistent snapshot recovery
Security Encrypted transmissions of cluster data. Permissions checked on file access
Permissions for users are checked on file open only
Disaster Recovery
Out of the box disaster recovery services with simple mirroring
No standard mirroring solution. Scripts are hard to manage and administer
Enterprise Integration
Minimal disruption to the ecosystem while integrating
-
Performance DirectShuffle technology leverages performance advantages
Apache Hadoop’s NFS cannot read or write to an open file.
Scalable without single point of failure
MapR clusters don’t use NameNodes and provide stateful high-availability for the MapReduce JobTracker and Direct Access NFS
Needs to be specially configured.
Proposed Architecture – Option 2 - MapR Distribution
Data Collection
Input Data Processing (ETL, Filtering)
Recommendation Data Building (Mahout)
Loading Final Data to Serving Layer
Recommendation Serving Layer
Output Data Post-Processing (Re-ordering)
Recommendation Engine – Architecture & Strategy
• Meta-data from item
• Normalize the meta-data into a feature vector • Compute Distances
Euclidean Distance Score Cosine Similarity Score Pearson Correlation Score
• Collaborative Filtering
Item-Based Recommendation
User-based Recommendation
• Group users into different clusters • Find representative items for each
cluster • Graph Traversal • Highest bought • Most liked
Recommendation Engine – Strategies
Construct a co-occurrence matrix (product similarity matrix), S[NxN]
Personalized Recommendation
Based on collaborative filtering
• Build preference vector
• Multiply both the matrices R = SxP
• Sort the final vector elements of R
ItemSimilarityJob
• Class to compute co-occurrence matrix
Algorithms
• Alternate Least Squares
• Singular Value Decomposition
Collaborative Filtering
RecommenderJob
• Main class to generate personalized recommendations
• Input file
• Similarities –
• CoOccurenceCountSimilarity
• TanimotoCoefficientSimilarity
• LogLikelihoodSimilarity
Role of Mahout
[ Preference matrix,S : Similarity matrix R : Recommendation matrix]