Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

transcript

Alexey GaydukRoman Nikolaenko

E-commerce

E-commerceAmazon

WalmartTarget

Macy’s

Intelligence

Product

What are the characteristics?

● Size● Material● Pocket Style● Weave type● Hem Style● Cleaning● Fit

Target

● Weave Type: Denim● Pocket Style: 5 Pocket

Pockets● Cleaning: Machine Wash Cold● Rise: Low Rise Rise● Fit: 3, Mid Waist● Decorative Details: Top

Stitching● Protective Features: Stretch● Hem Style: Finished Hems

Wrangler Jeans - Women’s Bootcut Jeans

Target

WalmartWrangler Jeans - Women’s Bootcut Jeans

Wrangler® Womens Bootcut Jean - Grey

Target

Walmart

● Weave Type: Denim● Pockets: 2 hip pockets,

1 watch pocket, 2 front scoop pockets

● Fabric Care Instructions: Machine Wash,Tumble Dry

● Decorative Details: Top Stitching

● Fabric Content: Cotton,Spandex

● Data capture● Processing and storage● Reports

What problems do we solve?

Data Capture

Crawler● Distributed Uses EC2 for crawling ● Failover The failed node will be replaced with another.

Data Capture

{ "Name":"Wrangler® Womens Bootcut Jean - Grey", "Weave Type":"Denim", "Pockets":"2 hip pockets, 1 watch pocket, 2 front scoop pockets", "Fabric Care Instructions":"Machine Wash,Tumble Dry", "Decorative Details":"Top Stitching", "Fabric Content":"Cotton,Spandex"}

Processing and Storage

● Distributed data storage ● Distributed data processing

● Files are stored as blocks● Write once, read many times● Reliability by replication● One central point of access to files

HDFS (Hadoop Distributed File System)

http://dev-time.org/?p=893

Data Matching

Target

Walmart

● Weave Type: Denim● Pockets: 2 hip pockets,

1 watch pocket, 2 front scoop pockets

● Fabric Care Instructions: Machine Wash,Tumble Dry

● Decorative Details: Top Stitching

● Fabric Content: Cotton,Spandex

How to match?

Data Matching

Katta (Lucene index storage)

● Distributed storage of Lucene index● Makes serving large or high load indices

easy.● Failover● Data replication● Easy to scale● Plays well with Hadoop cluster

It's time to...

Roman Nikolaenko

Crawled Data Target Amazon eBay

Sears Walmart

Crawled Data

"Human" Cloud

Data Storage for Reports

"Human" Cloud

Hadoop

Crawled Data

Hadoop

Data Storage for Reports

{ "Name":"Wrangler® Womens Bootcut Jean - Grey", "Weave Type":"Denim", "Pockets":"2 hip pockets, 1 watch pocket, 2 front scoop pockets", "Fabric Care Instructions":"Machine Wash,Tumble Dry", "Decorative Details":"Top Stitching", "Fabric Content":"Cotton,Spandex"}

{ "NAME":"Womens Bootcut Jean", "MFGR_NAME":WRANGLER", "COLOR":"GREY", "WEAVE_TYPE":"DENIM", "POKETS_TYPES":"2 HIP|1 WATCH|2 FRONT SCOOP", "CARE_INSTRUCTIONS":"MACHINE WASH|TUMBLE DRY", "DECORATIVE_DETAILS":"Top Stitching", "CONTENT":"COTTON, SPANDEX", "FINGERPRINT":"Womens Bootcut Jean!WRANGLER", "FINGERPRINT_HASH":"3902152632", "MAS_PROD_ID":"72312"}

0. Load data to HDFS 1. Attribute Name Transformation 2. Attribute Values Normalization 3. Create Fingerprints 4. Make Mappings 5. Load data to Data Storage for Reports

Control

Coordination

Monitoring

Comfort

Project Manager

Task TypeTask

PROJECT

Task Type

http://example0.com/dataloader/service/

http://example2.com/normalization/service/

http://example1.com/fingerprint/service/

http://example3.com/lookup/service/

http://example4.com/reportingLoader/service/

PROJECT

http://example1.com/transformation/service/

REST API:(JSON as DTO)

GET PARAMETERS of TASK

START TASK

GET STATUS of TASK

STOP TASK

REST API:(JSON as DTO)

GET PARAMETERS of TASK START TASKGET STATUS of TASK STOP TASK

TASK_TYPE: http://example0.com/dataloader/service/

REST API:

Task Task

Task Type: http://example0.com/dataloader/service/

Project Manager

Task 1 Task TypeService URL

Web Service

Hadoop Local processing

Task 2

Web Service: Start MapReduce

Job job = getMapReduceJob(); job.waitForCompletion(true);

job.submit();

Web Service: MapReduce monitoring

job.isComplete();

job.isSuccessful();

job.mapProgress();

job.reduceProgress();

System.setProperty("path.separator", ":"); Configuration config = getConfig();FileSystem fileSystem = getFS(); fileSystem.copyFromLocalFile(source, destination); DistributedCache.addArchiveToClassPath(destination, config, fileSystem);

Web Service: MapReduce & Third Party Libraries

Job configuration file on cluster will contain: mapred.cache.archives =hdfs://namenode.com:9000/distributedCache/gson-1.7.1.jar,... mapred.job.classpath.archives = /distributedCache/gson-1.7.1.jar:...

Web Service: MapReduce & Third Party Libraries

Reporting Application

Dashboard

Get Report REST API

Reports Data Storage Facade

Create Report REST API

MongoDB cluster

http://spf13.com/post/mongodb-and-hadoop

MongoDB cluster

CLIENT_546 collection: ...{"report_type":"PROD COUNT", "snapshot_time" : "2011-01-15", "AMAZON":"500", "TARGET":"300","WALMART":"900"}...{"report_type":"PRICE COMPARE", "MAS_PROD_ID":"72312", "snapshot_time" : "2011-02-15","NAME":"Womens Bootcut Jean","MFGR_NAME":WRANGLER","AMAZON_PRICE":"50", "TARGET_PRICE":"45","WALMART_PRICE":"55"}...

UI For Client

JSON query JSON report

Reducer Reducer

JSON report row

Hadoop

Contacts gayduk.a.s.ua@gmail.com oleksiy_gayduk http://www.linkedin.com/pub/alexey-gayduk/4/39b/a31

Alexey Gayduk

sage.nrs@gmail.com roman_jd_nikolaenko http://ua.linkedin.com/pub/roman-nikolaienko/2b/413/431

Roman Nikolaenko

Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Technology