Date post: | 10-Nov-2014 |
Category: |
Technology |
Upload: | mongodb |
View: | 212 times |
Download: | 0 times |
One Catalog Service to rule them all
Antoine GirbalPrincipal Solutions Engineer, MongoDB Inc.@antoinegirbal
Problem Statement
3
The many catalogs problem
4
1. One department in charge of master product works hard at fitting data into SQL tables
2. Resulting data sits in a SQL server with a couple replicas. It's forbidden to hit it more than 100 times / sec
3. Other departments need to access the data way more often for their own services
4. Other departments need more information that is not available since it did not fit in that long devised rigid SQL schema
5. ETLs and Message Buses are put in place for other teams to try figure it out themselves…
6. Data becomes inconsistent, fragmented, not up-to-date…Problem visible both internally and by customers!
The many catalogs problem
5
How many Catalogs and
Catalog Caches do you have?
Search – Using Solr
6
The many catalogs problem
Online Store
Catalog
Marketing
Catalog
Department 3
Catalog
Product Department
MasterCatalog
Department 4
Catalog
Department 5
Catalog
Department 1
Catalog
Message Bus
ETLs
Dozens of catalogs!
7
• Single view of a product, one central catalog service
• Flexible schema containing all useful data
• Read volume high and sustained, 100k reads / s
• Can seamlessly take write spikes during catalog update
• Advanced indexing and querying
• Geographical distribution for HA and low latency
Goal: Single View of Product
8
1. MongoDB Overview
2. Catalog Service Architecture
3. Data Store Models
4. Product Search
Agenda
MongoDB Overview
10
• Holds complex JSON structures
• Dynamic Schema for Agility
• complex querying and in-place updating
• Secondary, compound and geo indexing
• full consistency, durability, atomic operations
• HA and geo-distributed via Replication
• Near linear scaling via Sharding
• Overall, MongoDB is a unique fit!
MongoDB is a great fit
11
MongoDB Strategic Advantages
Horizontally Scalable-Sharding
AgileFlexible
High Performance &Strong Consistency
Application
HighlyAvailable-Replica Sets
{ customer: “roger”, date: new Date(), comment: “Spirited Away”, tags: [“Tezuka”, “Manga”]}
12
build your data to fit your application
Relational MongoDB{ customer_id : 1,
name : "Mark Smith",city : "San Francisco",orders: [ {
order_number : 13,store_id : 10,date: “2014-01-03”,products: [
{SKU: 24578234,
Qty: 3, Unit_price:
350},{SKU:
98762345, Qty: 1, Unit_Price:
110}]
},{ <...> }
]}
CustomerID First Name Last Name City0 John Doe New York1 Mark Smith San Francisco2 Jay Black Newark3 Meagan White London4 Edward Danields Boston
Order Number Store ID Product Customer ID10 100 Tablet 011 101 Smartphone 012 101 Dishwasher 013 200 Sofa 114 200 Coffee table 115 201 Suit 2
13
Notions
RDBMS MongoDB
Database Database
Table Collection
Row Document
Column Field
Catalog Service Architecture
15
Information Management
Merchandising
Content
Inventory
Customer
Channel
Sales & Fulfillment
Insight
Social
Architecture Overview
Customer
ChannelsAmazon
Ebay…
StoresPOSKiosk
…
MobileSmartphone
Tablet
Website
Contact Center
APIData and Service
Integration
SocialFacebook
Twitter…
Data Warehouse
Analytics
Supply Chain Management
System
Suppliers
3rd Party
In Network
Web Servers
Application Servers
16
Commerce Functional Components
Information Layer
Look & Feel
Navigation
Customization
Personalization
Branding
Promotions
Chat
Ads
Customer's Perspective
ResearchBrowseSearch
SelectShopping Cart
PurchaseCheckout
ReceiveTrack
UseFeedbackMaintain
DialogAssist
Market / Offer
Guide
Offer
Semantic Search
Recommend
Rule-based Decisions
Pricing
Coupons
Sell / Fullfill
Orders
Payments
Fraud Detection
Fulfillment
Business Rules
InsightSession CaptureActivity
Monitoring
Customer Enterprise
Information Management
Merchandising
Content
Inventory
Customer
Channel
Sales & Fulfillment
Insight
Social
17
Merchandising Components
Merchandising
MongoDB
Variant
Hierarchy
Pricing
Promotions
Ratings & Reviews
Calendar
Semantic Search
Item
Localization
19
MongoDB Data Store
Merchandising - Architecture
Items Pricing Promotions
VariantsRatings & Reviews
Search Engine
Product Service API
…
Online Store Marketing Inventory SCMS Public API …
Data Store Models
21
Models - Product Page
Product images
General Informatio
n
List of Variants
External Informatio
n
Localized Descriptio
n
22
• Item: the overall product info (e.g. Levi’s 501)
• Variant: a specific variant of an item (e.g. in black size 6) which typically has a specific SKU / UPC
• Price: price information may vary based on the store, the variant, etc
• Hierarchy: the item taxonomy
• Facet: facets to search products by
• Vendors: a given sku may be available through several vendors if the site is a marketplace
> Don't try to fit all in the same document!
Models - Overview
23
Hundreds of sizes
One Item
Dozens of colors
Models – Overview
24
• A single item may have thousands of variants
• Each variant can have hundreds of attributes
• Altogether a single item can represent many MBs worth of JSON text
• Don't try to fit everything into the same document!
• Use a schema that is natural and fits the API
Models - Overview
25
{ "_id": "054VA72303012P", // the item id "desc": [ // item descriptions { "lang": "en", "val": "Give your dressy look a lift with ..." }, ... ], "name": "Women's Kate Ivory Peep-Toe Stiletto Heel", "category": "/84700/80009/1282094266/1200003270", // hierarchy "brand": { "id": "2483510", "img": "http://...", "name": "Metaphor" }, "assets": { // references to all assets "imgs": [ { "img": { "width": 1900, "height": 1900, "src": "http://..." }, ... ] }, "shipping": { // shipping specs }, "specs": { // item specs }, "attrs": [ // list of items attributes (facets) { "name": "Heel Height", "value": "High (2-1/2 to 4 in.)" }, { "name": "Toe", "value": "Open toe" }, ... ], "variants": { // quick info on the variants "cnt": 9, "attrs": [ { "dispType": "DROPDOWN", "name": "Color" }, { "dispType": "DROPDOWN", "name": "Shoe Size" }, ... ] }, "lastUpdated": 1400877254787 // keep track of updates }
Models - Item Model
26
• Get item by id
db.definition.findOne( { _id: "301671" } )
• Get items from list of ids
db.definition.findOne( { _id: { $in: ["301671", "301672" ] } } )
• Get items by department
db.definition.find({ category: { $regex: "^/84700/" } })
• Get items by category prefix
db.definition.find( { category: { $regex: "^/84700/80009/" } } )
• Secondary Indices
name, category, lastUpdated
Models - Item Model
27
{ "_id": "05458452563", // the sku
"name": "Width:Medium,Color:Ivory,Shoe Size:6.5",
"itemId": "054VA72303012P", // reference to the item id
"altIds": { "upc": "632576103580" },
"assets": { // list of assets specific to variant
"imgs": [
{ "width": 1900, "height": 1900, "src": "http://..." },
{ "width": 1900, "height": 1900, "src": "http://..." }, ...
]
},
"attrs": [ // list of attributes specific to variant
{ "name": "Width", "value": "Medium" },
{ "name": "Color", "family": "White", "value": "Ivory" },
{ "name": "Size", "value": "6.5" }, ...
],
"lastUpdated": 1400877254787 // keep track of updates }
Models – Variant Model
28
• Get variant from SKU
db.variant.find( { _id: "05458452563" } )
• Get all variants for a product, sorted by SKU
db.variant.find( { itemId: "054VA72303012P" } ).sort( { _id: 1 } )
• Indices
itemId, lastUpdated
Models – Variant Model
29
Models - Hierarchy
{
"_id": "1200003270", // the node id
"name": "Women's Heels & Pumps",
"count": 22305, // how many items in this category
"parents": [ // list of parents
"1282094266"
],
"facets": [ // facets that exists for this category
"Heel Height",
"Toe",
"Upper Material",
"Width",
"Shoe Size",
"Color"
]
}
30
• Get hierarchy node by id
db.hierarchy.find( { _id: "1200003270" } )
• Get hierarchy node from parent id
db.hierarchy.find( { parents: "1282094266" } )
• Get departments (no parent)
db.hierarchy.find( { parents: null } )
• Secondary Indices
parents
Models – Hierarchy
31
Per store pricing could result in billions of documents…unless it is built in a modular way:
_id: concatenation of item and store.
Item: can be an item id or variant id (sku)
Store: can be a store group (online) or store id.
Models – per Store Pricing
{ "_id": "skuSPM8824542513_1234/store123", "price": 69.99, "sale": { "salePrice": 42.72, "saleEndDate": "2050-12-31 23:59:59" }, "lastUpdated": 1374647707394 }
32
• Get all prices for a given item
db.prices.find( { _id: /^item301671/ )
• Get all prices for a given sku (price could be at item level)
db.prices.find( { _id: { $in: [ /^sku730223104376/, /^item301671/ ])
• Get minimum and maximum prices for a sku
db.prices.aggregate( { match }, { $group: { _id: 1, min: { $min: price },
max: { $max : price} } })
• Get price for a sku and store id (returns up to 4 prices)
db.prices.find( { _id: { $in: [ "sku730223104376/store1234",
"sku730223104376/sgroup0",
"item301671/store1234",
"item301671/sgroup0"] , { price: 1 })
Models – per store Pricing
Product Search
34
Search – Browse and Search products
Browse by category
Special Lists
Filter by attributes
Lists hundreds of item
summaries
By far the toughest page to get right and fast …
35
The previous page presents many challenges:
• Response within milliseconds for hundreds of items
• Faceted search on many attributes: category, brand, …
• Efficient sorting on several attributes: price, popularity
• Pagination feature which requires deterministic ordering
> Search engines are built for this purpose!
Search – Browse and Search products
36
Search – Traditional Architecture
Product Data Store Product Search
Indexing
#1 obtain search
results IDs
ApplicationCache
#2 obtain objects by ID from cache or DB
Pre-joined into objects
37
The traditional architecture issues:
• 3 different systems to maintain: RDBMS, Search engine, Caching layer
• RDBMS schema is complex and static
• Applications needs to talk many languages
Search – Traditional Architecture
38
Search – Architecture with MongoDB
Product Data Store Product Search
Indexing
#1 obtain search
results IDs
Applications
#2 obtain objects by list of IDs
MongoDB
Ready-to-use product documents
Search Engine
Product API
Application issues single
query
39
MongoDB
Search - Mongo-Connector
Search Engine
OplogMongo
Connector
#1 Initial dump of the
collections
#2 Updates streaming via
OplogTranslation, filtering
Indexing
Indexing
40
• Open-source Project at https://github.com/10gen-labs/mongo-connector
• Python app that reads from MongoDB's oplog and publishes to target of choice
• Supports initial sync by dumping the data
• Default connectors for Solr, Elastic Search, other MongoDB cluster
• Easily extensible to update other systems like SQL
Search - Mongo-Connector
41
What is the data to index?
Search – Mongo-Connector
42
Search – More Searching
Images of the matching variants are displayed
Facets for variants
Price and Rating
43
… more challenges:
• Attributes at the variant level: color, size, etc
• Attributes from other docs: pricing, ratings, etc
• Display the matching variant's image and details
• Thousands of matching variants for an item, still need to display a single item
• Challenge to properly index the data
> Need for a single summary document per item
Search – More Searching
44
MongoDB Data Store
Search - Architecture
SummariesItems Pricing
PromotionsVariantsRatings & Reviews
45
{ "_id": "3ZZVA46759401P", // the item id "name": "Women's Chic - Black Velvet Suede", "dep": "84700", // useful as standalone for indexing "cat": "/84700/80009/1282094266/1200003270", "desc": { "lang": "en", "val": "This pointy toe slingback ..." }, "img": { "width": 450, "height": 330, "src": "http://..." }, "attrs": [ // global attributes, easily indexable by SE "heel height=mid (1-3/4 to 2-1/4 in.)", "brand=metaphor", "shoe size=6", "shoe size=6.5", ... ], "sattrs": [ // global attributes, not to be indexed "upper material=synthetic", "toe=open toe", ... ], "vars": [ { "id": "05497884001", "img": [ // images], "attrs": [ // list of variant attributes to index ] "sattrs": [ // list of variant attributes not to index ] }, … ] }
Search – Summary Model
46
Let's use Solr …
Search – Using Solr
47
Search - Using Solr
48
Search - Using Solr
Defining the schema in schema.xml
<fields> <!-- some of the core fields --> <field name="_id" type="string" indexed="true" stored="true" /> <field name="name" type="text_general" indexed="true" stored="true" /> <field name="cat" type="string" indexed="true" stored="true" /> <field name="price" type="float" indexed="true" stored="true"/>
<!-- the full text to index --> <field name="desc.0.val" type="text_general" indexed="true" stored="true"/>
<!-- dynamic attributes for facetting --> <dynamicField name="attrs.*" type="string" indexed="true" stored="true"/>
<!– some Solr specific fields --> <field name="_version_" type="long" indexed="true" stored="true"/> <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/> <dynamicField name="*" type="ignored" multiValued="true"/></fields>
49
Search - Using Solr
Starting up the connector
> Keep it running, it will just stream the Oplog
> mongo-connector -m ec2-54-80-63-229.compute-1.amazonaws.com:27017 // the mongo -t http://localhost:8983/solr // the solr -d mongo_connector/doc_managers/solr_doc_manager.py -n "catalog.summary" // target summary collection --auto-commit-interval=60 // commit every 1 min…
50
Document in Solr looks like:
Lists are flattened which is difficult to use
> Must use to named fields to implement Facets
Search – Using Solr
{ "desc.0.val": "Our classic \"Flying Duck\" styled as a ...", "name": "Drake Waterfowl Duck Label SS T-Shirt Army Green", "attrs.1": "brand=Drake Waterfowl", "attrs.0": "style=t-shirts", "cat": "/84700/1200000239/1282094207/1200000817", "_id": "SPM10823491916", "_version_": 1479173524477182000, "timestamp": "2014-09-13T23:09:59.782Z"}
51
Let's use Elastic Search…
Search – Using Elastic Search
52
Search - Using Elastic Search
53
Search - Using Elastic Search
ElasticSearch understands whole document right off the bat
Just need to tell ES not to tokenize the facets:
> Everything else is indexed auto-magically!
$ curl -XPOST localhost:9200/largecat3.summary -d '{ "settings" : { "number_of_shards" : 1 }, "mappings" : { "string" : { // string is the name of default mapping type "properties" : { "attrs" : { "type" : "string", "index" : "not_analyzed" } } } } }'
54
Search - Using Elastic Search
Starting up the connector
> Keep it running, it will just stream the Oplog
> mongo-connector -m ec2-54-80-63-229.compute-1.amazonaws.com:27017 // the mongo -t http://localhost:9200 // the ES -d mongo_connector/doc_managers/elastic_doc_manager.py -n "catalog.summary" // target summary collection --auto-commit-interval=60 // commit every 1 min…
55
Search - Using Elastic Search
Querying for documents, with Facet info… works well $ curl -X POST "http://localhost:9200/largecat3.summary/_search?pretty=true" -d ' { "query" : { "query_string" : {"query" : "Ipad"} }, "facets" : { "tags" : { "terms" : {"field" : "attrs"} } } }'{ "took" : 6, "hits" : { "total" : 151, "max_score" : 0.5892989, "hits" : [ { "_index" : "largecat3.summary", "_type" : "string", "_id" : "000000000000000012730000000000QAU-QR2442P", "_score" : 0.5892989, "_source": { // original JSON from MongoDB }, ... ] }, "facets" : { "tags" : { "_type" : "terms", "total" : 1577, "terms" : [ { "term" : "ring size=9", "count" : 120 }, { "term" : "ring size=8", "count" : 120 }, { "term" : "metal=sterling silver", "count" : 112 }, ... ] } } }
56
How about MongoDB's indexes and Full-Text-Search?
Search – Using MongoDB Indexing
57
The summary contains:
• department e.g. "Shoes"
• Fields to index
– Category path, e.g. "Shoes/Women/Pumps"
– Price
– List of Item Attributes, e.g. Brand = Guess
– List of Variant Attributes, e.g. Color = red
• Fields not to index
– List of Item Secondary Attributes, e.g. Style = Designer
– List of Variant Secondary Attributes, e.g. heel height = 4.0
Search – Using MongoDB indexing
58
• Get summary from item iddb.variation.find({ _id: "p301671" })
• Get summary's specific variation from SKUdb.variation.find( { "vars.sku": "730223104376" }, { "vars.$": 1 } )
• Get summary by department, sorted by ratingdb.variation.find( { department: "Shoes" } ).sort( { rating: 1 } )
• Get summary with mix of parametersdb.variation.find( { department : "Shoes" ,
"vars.attrs" : { "color" : "Gray"} , "category" : ^/Shoes/Women/ , "price" : { "$gte" : 65.99 , "$lte" :
180.99 } } )
Search - Using MongoDB indexing
59
Search – Using MongoDB indexing
• The following indices are used:– department + attr + category + _id– department + vars.attrs + category + _id– department + category + _id– department + price + _id– department + rating + _id
• _id used for pagination
• Can take advantage of index intersection
• With several attributes specified (e.g. color=red and size=6), which one is looked up?
60
Facet samples:
{ "_id" : "Accessory Type=Hosiery" , "count" : 14}
{ "_id" : "Ladder Material=Steel" , "count" : 2}
{ "_id" : "Gold Karat=14k" , "count" : 10138}
{ "_id" : "Stone Color=Clear" , "count" : 1648}
{ "_id" : "Metal=White gold" , "count" : 10852}
Single operations to insert / update:
db.facet.update( { _id: "Accessory Type=Hosiery" },
{ $inc: 1 }, true, false)
The facet with lowest count is the most restrictive…
It should come first in the $all query!
Search – Using MongoDB indexing
61
• Search Engine advantages:– Index size (~ 10x smaller than MongoDB's)
– Indexing speed
– Read speed, integrated cache
– All languages support
– Built-in facetted search, which includes facet counts
• MongoDB's Indexing advantages:– Built-in the data store, no additional server / software needed
– Single query to get the results
– Can filter down the variant entry and save computing
> Winner here is Elastic Search
Search – Comparing Solutions
62
Search – Benchmarking
Department Category Price Primary attribute
Time Average (ms)
90th (ms) 95th (ms)
1 0 0 0 2 3 3
1 1 0 0 1 2 2
1 0 1 0 1 2 3
1 1 1 0 1 2 2
1 0 0 1 0 1 2
1 1 0 1 0 1 1
1 0 1 1 1 2 2
1 1 1 1 0 1 1
1 0 0 2 1 3 3
1 1 0 2 0 2 2
1 0 1 2 10 20 35
1 1 1 2 0 1 1
Closing Comments
64
Q & A Time
Thank You!
Antoine GirbalPrincipal Solutions Engineer, MongoDB Inc.@antoinegirbal