Date post: | 30-Jun-2015 |
Category: |
Technology |
Upload: | agayduk |
View: | 1,292 times |
Download: | 2 times |
Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta
Alexey GaydukRoman Nikolaenko
E-commerce
E-commerceAmazon
WalmartTarget
Macy’s
Intelligence
Product
What are the characteristics?
● Size● Material● Pocket Style● Weave type● Hem Style● Cleaning● Fit
Target
● Weave Type: Denim● Pocket Style: 5 Pocket
Pockets● Cleaning: Machine Wash Cold● Rise: Low Rise Rise● Fit: 3, Mid Waist● Decorative Details: Top
Stitching● Protective Features: Stretch● Hem Style: Finished Hems
Wrangler Jeans - Women’s Bootcut Jeans
Target
● Weave Type: Denim● Pocket Style: 5 Pocket
Pockets● Cleaning: Machine Wash Cold● Rise: Low Rise Rise● Fit: 3, Mid Waist● Decorative Details: Top
Stitching● Protective Features: Stretch● Hem Style: Finished Hems
WalmartWrangler Jeans - Women’s Bootcut Jeans
Wrangler® Womens Bootcut Jean - Grey
Target
● Weave Type: Denim● Pocket Style: 5 Pocket
Pockets● Cleaning: Machine Wash Cold● Rise: Low Rise Rise● Fit: 3, Mid Waist● Decorative Details: Top
Stitching● Protective Features: Stretch● Hem Style: Finished Hems
Walmart
● Weave Type: Denim● Pockets: 2 hip pockets,
1 watch pocket, 2 front scoop pockets
● Fabric Care Instructions: Machine Wash,Tumble Dry
● Decorative Details: Top Stitching
● Fabric Content: Cotton,Spandex
Wrangler Jeans - Women’s Bootcut Jeans
Wrangler® Womens Bootcut Jean - Grey
● Data capture● Processing and storage● Reports
What problems do we solve?
Data Capture
Crawler● Distributed Uses EC2 for crawling ● Failover The failed node will be replaced with another.
Data Capture
{ "Name":"Wrangler® Womens Bootcut Jean - Grey", "Weave Type":"Denim", "Pockets":"2 hip pockets, 1 watch pocket, 2 front scoop pockets", "Fabric Care Instructions":"Machine Wash,Tumble Dry", "Decorative Details":"Top Stitching", "Fabric Content":"Cotton,Spandex"}
JSON
Processing and Storage
● Distributed data storage ● Distributed data processing
● Files are stored as blocks● Write once, read many times● Reliability by replication● One central point of access to files
HDFS (Hadoop Distributed File System)
http://dev-time.org/?p=893
Data Matching
Target
● Weave Type: Denim● Pocket Style: 5 Pocket
Pockets● Cleaning: Machine Wash Cold● Rise: Low Rise Rise● Fit: 3, Mid Waist● Decorative Details: Top
Stitching● Protective Features: Stretch● Hem Style: Finished Hems
Walmart
● Weave Type: Denim● Pockets: 2 hip pockets,
1 watch pocket, 2 front scoop pockets
● Fabric Care Instructions: Machine Wash,Tumble Dry
● Decorative Details: Top Stitching
● Fabric Content: Cotton,Spandex
Wrangler Jeans - Women’s Bootcut Jeans
Wrangler® Womens Bootcut Jean - Grey
Wrangler Jeans - Women’s Bootcut Jeans
Wrangler® Womens Bootcut Jean - Grey
How to match?
Data Matching
Katta (Lucene index storage)
● Distributed storage of Lucene index● Makes serving large or high load indices
easy.● Failover● Data replication● Easy to scale● Plays well with Hadoop cluster
It's time to...
Roman Nikolaenko
Crawled Data Target Amazon eBay
Sears Walmart
Crawled Data
"Human" Cloud
"Human" Cloud
Data Storage for Reports
Data Storage for Reports
"Human" Cloud
Hadoop
Crawled Data
Hadoop
Data Storage for Reports
Data
Flow
Task
Chain
{ "Name":"Wrangler® Womens Bootcut Jean - Grey", "Weave Type":"Denim", "Pockets":"2 hip pockets, 1 watch pocket, 2 front scoop pockets", "Fabric Care Instructions":"Machine Wash,Tumble Dry", "Decorative Details":"Top Stitching", "Fabric Content":"Cotton,Spandex"}
{ "NAME":"Womens Bootcut Jean", "MFGR_NAME":WRANGLER", "COLOR":"GREY", "WEAVE_TYPE":"DENIM", "POKETS_TYPES":"2 HIP|1 WATCH|2 FRONT SCOOP", "CARE_INSTRUCTIONS":"MACHINE WASH|TUMBLE DRY", "DECORATIVE_DETAILS":"Top Stitching", "CONTENT":"COTTON, SPANDEX", "FINGERPRINT":"Womens Bootcut Jean!WRANGLER", "FINGERPRINT_HASH":"3902152632", "MAS_PROD_ID":"72312"}
Task
Chain
0. Load data to HDFS 1. Attribute Name Transformation 2. Attribute Values Normalization 3. Create Fingerprints 4. Make Mappings 5. Load data to Data Storage for Reports
Task
Chain
Task
Chain
Control
Coordination
Monitoring
Comfort
Task
Chain
Project Manager
Task TypeTask
Task
Task
Task
Task
Task
PROJECT
Task Type
Task Type
Task Type
Task Type
Task Type
http://example0.com/dataloader/service/
http://example2.com/normalization/service/
http://example1.com/fingerprint/service/
http://example3.com/lookup/service/
http://example4.com/reportingLoader/service/
TASK
TASK
TASK
TASK
TASK
TASK
PROJECT
http://example1.com/transformation/service/
REST API:(JSON as DTO)
GET PARAMETERS of TASK
START TASK
GET STATUS of TASK
STOP TASK
REST API:(JSON as DTO)
GET PARAMETERS of TASK START TASKGET STATUS of TASK STOP TASK
TASK_TYPE: http://example0.com/dataloader/service/
Task
REST API:
Task Task
Task
Task Type: http://example0.com/dataloader/service/
Project Manager
Task 1 Task TypeService URL
Web Service
Hadoop Local processing
Task 2
Web Service: Start MapReduce
Job job = getMapReduceJob(); job.waitForCompletion(true);
OR
job.submit();
Web Service: MapReduce monitoring
job.isComplete();
job.isSuccessful();
job.mapProgress();
job.reduceProgress();
System.setProperty("path.separator", ":"); Configuration config = getConfig();FileSystem fileSystem = getFS(); fileSystem.copyFromLocalFile(source, destination); DistributedCache.addArchiveToClassPath(destination, config, fileSystem);
Web Service: MapReduce & Third Party Libraries
Job configuration file on cluster will contain: mapred.cache.archives =hdfs://namenode.com:9000/distributedCache/gson-1.7.1.jar,... mapred.job.classpath.archives = /distributedCache/gson-1.7.1.jar:...
Web Service: MapReduce & Third Party Libraries
Reporting Application
Dashboard
Get Report REST API
Reports Data Storage Facade
Create Report REST API
MongoDB cluster
http://spf13.com/post/mongodb-and-hadoop
MongoDB cluster
CLIENT_546 collection: ...{"report_type":"PROD COUNT", "snapshot_time" : "2011-01-15", "AMAZON":"500", "TARGET":"300","WALMART":"900"}...{"report_type":"PRICE COMPARE", "MAS_PROD_ID":"72312", "snapshot_time" : "2011-02-15","NAME":"Womens Bootcut Jean","MFGR_NAME":WRANGLER","AMAZON_PRICE":"50", "TARGET_PRICE":"45","WALMART_PRICE":"55"}...
UI For Client
Reports Data Storage Facade
JSON query JSON report
Reducer Reducer
JSON report row
Hadoop
Reports Data Storage Facade
Contacts [email protected] oleksiy_gayduk http://www.linkedin.com/pub/alexey-gayduk/4/39b/a31
Alexey Gayduk
[email protected] roman_jd_nikolaenko http://ua.linkedin.com/pub/roman-nikolaienko/2b/413/431
Roman Nikolaenko