Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Post on 30-Jun-2015

1,292 views 2 download

description

Хотите услышать о проекте, где используется стек технологий из Hadoop для распределенной обработки и хранения данных, Katta для распределенного хранения и обработки Lucene индексов, MongoDB для хранения неструктурированных данных? Мы хотели бы рассказать о реальном опыте применения этой связки, с какими проблемами мы столкнулись и как мы их решали. Допустим одна из проблем это использование сторонних библиотек в Hadoop Map/Reduce, все очевидно, но как сделать это красиво и удобно? Или как запустить Hadoop job из под web приложения, а не из консоли, и мониторить ее выполнение? А вот проблема хранения и обработки неструктурированных данных в MySql. Что за данные мы хранили там и почему решили использовать MongoDB? И зачем же мы все-таки используем Katta? Все эти проблемы и их решения исходят из реальной бизнес идеи, и обо всем этом мы расскажем вам.

transcript

Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Alexey GaydukRoman Nikolaenko

E-commerce

E-commerceAmazon

WalmartTarget

Macy’s

Intelligence

Product

What are the characteristics?

● Size● Material● Pocket Style● Weave type● Hem Style● Cleaning● Fit

Target

● Weave Type: Denim● Pocket Style: 5 Pocket

Pockets● Cleaning: Machine Wash Cold● Rise: Low Rise Rise● Fit: 3, Mid Waist● Decorative Details: Top

Stitching● Protective Features: Stretch● Hem Style: Finished Hems

Wrangler Jeans - Women’s Bootcut Jeans

Target

● Weave Type: Denim● Pocket Style: 5 Pocket

Pockets● Cleaning: Machine Wash Cold● Rise: Low Rise Rise● Fit: 3, Mid Waist● Decorative Details: Top

Stitching● Protective Features: Stretch● Hem Style: Finished Hems

WalmartWrangler Jeans - Women’s Bootcut Jeans

Wrangler® Womens Bootcut Jean - Grey

Target

● Weave Type: Denim● Pocket Style: 5 Pocket

Pockets● Cleaning: Machine Wash Cold● Rise: Low Rise Rise● Fit: 3, Mid Waist● Decorative Details: Top

Stitching● Protective Features: Stretch● Hem Style: Finished Hems

Walmart

● Weave Type: Denim● Pockets: 2 hip pockets,

1 watch pocket, 2 front scoop pockets

● Fabric Care Instructions: Machine Wash,Tumble Dry

● Decorative Details: Top Stitching

● Fabric Content: Cotton,Spandex

Wrangler Jeans - Women’s Bootcut Jeans

Wrangler® Womens Bootcut Jean - Grey

● Data capture● Processing and storage● Reports

What problems do we solve?

Data Capture

Crawler● Distributed Uses EC2 for crawling ● Failover The failed node will be replaced with another.

Data Capture

{ "Name":"Wrangler® Womens Bootcut Jean - Grey", "Weave Type":"Denim", "Pockets":"2 hip pockets, 1 watch pocket, 2 front scoop pockets", "Fabric Care Instructions":"Machine Wash,Tumble Dry", "Decorative Details":"Top Stitching", "Fabric Content":"Cotton,Spandex"}

JSON

Processing and Storage

● Distributed data storage ● Distributed data processing

● Files are stored as blocks● Write once, read many times● Reliability by replication● One central point of access to files

HDFS (Hadoop Distributed File System)

http://dev-time.org/?p=893

Data Matching

Target

● Weave Type: Denim● Pocket Style: 5 Pocket

Pockets● Cleaning: Machine Wash Cold● Rise: Low Rise Rise● Fit: 3, Mid Waist● Decorative Details: Top

Stitching● Protective Features: Stretch● Hem Style: Finished Hems

Walmart

● Weave Type: Denim● Pockets: 2 hip pockets,

1 watch pocket, 2 front scoop pockets

● Fabric Care Instructions: Machine Wash,Tumble Dry

● Decorative Details: Top Stitching

● Fabric Content: Cotton,Spandex

Wrangler Jeans - Women’s Bootcut Jeans

Wrangler® Womens Bootcut Jean - Grey

Wrangler Jeans - Women’s Bootcut Jeans

Wrangler® Womens Bootcut Jean - Grey

How to match?

Data Matching

Katta (Lucene index storage)

● Distributed storage of Lucene index● Makes serving large or high load indices

easy.● Failover● Data replication● Easy to scale● Plays well with Hadoop cluster

It's time to...

Roman Nikolaenko

Crawled Data Target Amazon eBay

Sears Walmart

Crawled Data

"Human" Cloud

"Human" Cloud

Data Storage for Reports

Data Storage for Reports

"Human" Cloud

Hadoop

Crawled Data

Hadoop

Data Storage for Reports

Data

Flow

Task

Chain

{ "Name":"Wrangler® Womens Bootcut Jean - Grey", "Weave Type":"Denim", "Pockets":"2 hip pockets, 1 watch pocket, 2 front scoop pockets", "Fabric Care Instructions":"Machine Wash,Tumble Dry", "Decorative Details":"Top Stitching", "Fabric Content":"Cotton,Spandex"}

{ "NAME":"Womens Bootcut Jean", "MFGR_NAME":WRANGLER", "COLOR":"GREY", "WEAVE_TYPE":"DENIM", "POKETS_TYPES":"2 HIP|1 WATCH|2 FRONT SCOOP", "CARE_INSTRUCTIONS":"MACHINE WASH|TUMBLE DRY", "DECORATIVE_DETAILS":"Top Stitching", "CONTENT":"COTTON, SPANDEX", "FINGERPRINT":"Womens Bootcut Jean!WRANGLER", "FINGERPRINT_HASH":"3902152632", "MAS_PROD_ID":"72312"}

Task

Chain

0. Load data to HDFS 1. Attribute Name Transformation 2. Attribute Values Normalization 3. Create Fingerprints 4. Make Mappings 5. Load data to Data Storage for Reports

Task

Chain

Task

Chain

Control

Coordination

Monitoring

Comfort

Task

Chain

Project Manager

Task TypeTask

Task

Task

Task

Task

Task

PROJECT

Task Type

Task Type

Task Type

Task Type

Task Type

http://example0.com/dataloader/service/

http://example2.com/normalization/service/

http://example1.com/fingerprint/service/

http://example3.com/lookup/service/

http://example4.com/reportingLoader/service/

TASK

TASK

TASK

TASK

TASK

TASK

PROJECT

http://example1.com/transformation/service/

REST API:(JSON as DTO)

GET PARAMETERS of TASK

START TASK

GET STATUS of TASK

STOP TASK

REST API:(JSON as DTO)

GET PARAMETERS of TASK START TASKGET STATUS of TASK STOP TASK

TASK_TYPE: http://example0.com/dataloader/service/

Task

REST API:

Task Task

Task

Task Type: http://example0.com/dataloader/service/

Project Manager

Task 1 Task TypeService URL

Web Service

Hadoop Local processing

Task 2

Web Service: Start MapReduce

Job job = getMapReduceJob(); job.waitForCompletion(true);

OR

job.submit();

Web Service: MapReduce monitoring

job.isComplete();

job.isSuccessful();

job.mapProgress();

job.reduceProgress();

System.setProperty("path.separator", ":"); Configuration config = getConfig();FileSystem fileSystem = getFS(); fileSystem.copyFromLocalFile(source, destination); DistributedCache.addArchiveToClassPath(destination, config, fileSystem);

Web Service: MapReduce & Third Party Libraries

Job configuration file on cluster will contain: mapred.cache.archives =hdfs://namenode.com:9000/distributedCache/gson-1.7.1.jar,... mapred.job.classpath.archives = /distributedCache/gson-1.7.1.jar:...

Web Service: MapReduce & Third Party Libraries

Reporting Application

Dashboard

Get Report REST API

Reports Data Storage Facade

Create Report REST API

MongoDB cluster

http://spf13.com/post/mongodb-and-hadoop

MongoDB cluster

CLIENT_546 collection: ...{"report_type":"PROD COUNT", "snapshot_time" : "2011-01-15", "AMAZON":"500", "TARGET":"300","WALMART":"900"}...{"report_type":"PRICE COMPARE", "MAS_PROD_ID":"72312", "snapshot_time" : "2011-02-15","NAME":"Womens Bootcut Jean","MFGR_NAME":WRANGLER","AMAZON_PRICE":"50", "TARGET_PRICE":"45","WALMART_PRICE":"55"}...

UI For Client

Reports Data Storage Facade

JSON query JSON report

Reducer Reducer

JSON report row

Hadoop

Reports Data Storage Facade

Contacts gayduk.a.s.ua@gmail.com oleksiy_gayduk http://www.linkedin.com/pub/alexey-gayduk/4/39b/a31

Alexey Gayduk

sage.nrs@gmail.com roman_jd_nikolaenko http://ua.linkedin.com/pub/roman-nikolaienko/2b/413/431

Roman Nikolaenko