+ All Categories
Home > Technology > Crawling and Tracking Millions of eCommerce Products at Scale

Crawling and Tracking Millions of eCommerce Products at Scale

Date post: 08-Jan-2017
Category:
Upload: qiaoliang-xiang
View: 68 times
Download: 1 times
Share this document with a friend
48
Crawling and Tracking Millions of eCommerce Products at Scale Qiaoliang Xiang Head of Data Science [email protected] , [email protected]
Transcript
Page 1: Crawling and Tracking Millions of eCommerce Products at Scale

CrawlingandTrackingMillionsofeCommerceProductsatScale

QiaoliangXiangHeadofDataScience

[email protected],[email protected]

Page 2: Crawling and Tracking Millions of eCommerce Products at Scale

Outline•  IntroducFon•  Crawl•  Product•  Store•  Conclusion

Page 3: Crawling and Tracking Millions of eCommerce Products at Scale

IntroducFon–Stores

Page 4: Crawling and Tracking Millions of eCommerce Products at Scale

IntroducFon–Store

Page 5: Crawling and Tracking Millions of eCommerce Products at Scale

IntroducFon–Products

Page 6: Crawling and Tracking Millions of eCommerce Products at Scale

IntroducFon–Problem•  Problem– Howtocrawlproducts?

•  Goals–  Flexible–  Scalable

Crawl Process Save

Workflow

Page 7: Crawling and Tracking Millions of eCommerce Products at Scale

Crawl•  Workflow–  Traversal–  Fetch–  Extract

•  System

Page 8: Crawling and Tracking Millions of eCommerce Products at Scale

Crawl–Workflow•  Workflow–  Traverseawebsitetogetproductlinks–  Fetchproductdata(i.e.,HTML)–  Extractproducts

Traverse Fetch ExtractWebsite ProductStore

Page 9: Crawling and Tracking Millions of eCommerce Products at Scale

Workflow–DomainIndependent•  Workflow–  TraversewebsiteandfetchHTMLs–  Extractproducts(domainknowledge)

Website Traverse&Fetch HTMLStore

ProductStoreExtract

Page 10: Crawling and Tracking Millions of eCommerce Products at Scale

Workflow–DomainIndependent•  Traverse&Fetch–  SeedURL–  FetchHTML–  ExtractLinks–  Removeduplicateandvisitedlinks

SeedURL HTML ExtractedLinks UnvisitedLinks

Page 11: Crawling and Tracking Millions of eCommerce Products at Scale

Workflow–DomainDependent•  Workflow–  Traverseproductlinks

•  Fetchproductpage•  Extractproduct

Website Traverse ProductLinks

ProductStoreFetch&Extract

Page 12: Crawling and Tracking Millions of eCommerce Products at Scale

Traversal•  Howtotraverseawebsiteefficientlyandflexibly?–  PaVerns–  Traversal–  Components

Traverse Fetch Extract

Workflow

Page 13: Crawling and Tracking Millions of eCommerce Products at Scale

Traversal–PaVerns–Categories•  Mostwebsitesusecategoriestoorganizeproducts

Page 14: Crawling and Tracking Millions of eCommerce Products at Scale

Traversal–PaVerns–Summaries•  Productsummarypagesarelistedpage-by-page

Productsummary PaginaFon

Page 15: Crawling and Tracking Millions of eCommerce Products at Scale

Traversal–PaVerns-Structure•  Categories->Pages->Summaries->Details

Page

Category Category

Website

Page PagePage

Summary Summary SummarySummarySummary Summary

Detail Detail DetailDetailDetail Detail

Page 16: Crawling and Tracking Millions of eCommerce Products at Scale

Traversal–Depth-first•  Category->Page->Summary->Detail

Page

Category Category

Website

Page PagePage

Summary Summary SummarySummarySummary Summary

Detail Detail DetailDetailDetail Detail

Page 17: Crawling and Tracking Millions of eCommerce Products at Scale

Traversal–Breadth-first•  Categories|Pages|Summaries|Details

Page

Category Category

Website

Page PagePage

Summary Summary SummarySummarySummary Summary

Detail Detail DetailDetailDetail Detail

Page 18: Crawling and Tracking Millions of eCommerce Products at Scale

Traversal–Components•  Category•  Page•  Summary•  Detail

Website Categorycrawler Categorylinks

Categorylink Pagecrawler Pagelinks

Pagelink Summarycrawler Products(summaries)

Productlink Detailcrawler Product(detail)

ComponentCrawler:one-to-many

Page 19: Crawling and Tracking Millions of eCommerce Products at Scale

Crawl–Fetch•  Whattofetch?– API:JSON,CSV,XML,HTML,etc.– HTML

•  requests–HTTPforHumans–  Requests:GETandPOST–  SessionandCookies–  Streamingdownloads Traverse Fetch Extract

Workflow

Page 20: Crawling and Tracking Millions of eCommerce Products at Scale

Crawl–Extract•  Howtoextractproducts?–  API:JSON,CSV,XML–  HTML

•  JavaScriptObject•  DocumentObjectModel(DOM)

•  Tools–  re–regularexpressionoperaFons–  BeauFfulSoup–extractdatafromHTML/XML

Page 21: Crawling and Tracking Millions of eCommerce Products at Scale

Crawl–Extract–BestPracFces•  BestPracFces

–  APIData•  JSON,CSV,XML

–  JavaScriptData•  JavaScriptObject

–  HTMLString•  Useregex(i.e.,price,link)

–  HTMLDOM•  Opengraphtags•  CSS:id>class

Page 22: Crawling and Tracking Millions of eCommerce Products at Scale

Crawl–System•  IDE:PyCharm•  Python:3.5•  CodingStyle:PEP8•  DocumentaFon:sphinx•  Logging:logging•  Unittest:pytest•  Deploy:fabric•  AWS:boto3

Page 23: Crawling and Tracking Millions of eCommerce Products at Scale

Crawl–System–Scale

Category

Database

category page

summary

Page

Summary

Save

Website

detail

Detail

1

2

3

45

queue

Page 24: Crawling and Tracking Millions of eCommerce Products at Scale

Crawl–Conclusion•  Crawl– DomainKnowledge–Efficient–  Component–Flexible&Reusable– Microservice–Scale

Website Category ProductStoreSummaryPage Detail

Components

Page 25: Crawling and Tracking Millions of eCommerce Products at Scale

Product•  Aconsistentrepresentofincompa<bleproducts?– DefiniFon–  RepresentaFon–  Schema ?

Page 26: Crawling and Tracking Millions of eCommerce Products at Scale

DefiniFon–Product&Field•  What’saproduct?– Aproductisagroup/bagoffields

•  What’safield?– Afieldisanaspectofaproduct– Afieldhasanameandavalue–  Example:name,URL,images,price,etc.

Page 27: Crawling and Tracking Millions of eCommerce Products at Scale

DefiniFon–FieldGuidelines•  Howtodefineafield?– Name

•  Unique,informaFve,meaningful–  Value

•  FollowthebestpracFces(i.e.,country,currency,URL)•  Bespecifictomaintainconsistency•  Begeneraltodealwithdifferentformats

Page 28: Crawling and Tracking Millions of eCommerce Products at Scale

DefiniFon–FieldGroup•  Whatarethefields?–  Thenumberoffieldsispossiblyinfinite!–  FieldGroup

•  Relevant:relevanttobusiness•  OpFonal:notrequiredbutgoodtohave

FieldGroup Relevanttobusiness Availableinwebsites ExamplesRelevant Yes Yes Name,Price,OpFonal No Maybe Stocklevel,Review

Page 29: Crawling and Tracking Millions of eCommerce Products at Scale

DefiniFon–RelevantFieldsName Type Descrip<on ExampleName String NameorFtle AppleiPhone6URL URL Productpagelink hVp://www...Images URLs Imagelinks hVp://www…,…Currency String Currencycode SGDPrice String Priceorrange 888.8-1048Originalprice String Originalpriceorrange 1111.5Category Strings Categorypath(levels) Mobiles,SmartPhoneBrand String Brandname AppleDescripFon Strings Listofparagraphs ReFnaHD,3DTouchAVributes Map Key-valuepairs Color:grey,Memory:16G

Page 30: Crawling and Tracking Millions of eCommerce Products at Scale

RepresentaFon–Python•  HowtorepresentaproductinPython?–  Richindatatypes

•  None,bool,int,float,str,list,dict,dateFme,date,…–  RepresentaFon

•  dict–  ProperFes

•  Flexible•  Notuniversal

Page 31: Crawling and Tracking Millions of eCommerce Products at Scale

RepresentaFon–JSON•  HowtorepresentaproductinJSON?–  Lessbutpowerfuldatatypes

•  null,true/false,number,string,array,object–  RepresentaFon

•  object–  ProperFes

•  Text•  Data-interchange•  Widelysupported

Page 32: Crawling and Tracking Millions of eCommerce Products at Scale

Schema–IntroducFon•  Schema–  Definethefieldsofaproduct–  Validateaproduct(dict)–  Convert:dict<->primiFvedict–  Serialise/deserialise:primiFvedict<->JSON

•  Tools–  JSONSchema:JSON–  Schema:Python–  Schema<cs:PythonORM

DefiniFon

RepresentaFon

Schema

Page 33: Crawling and Tracking Millions of eCommerce Products at Scale

Schema–SchemaFcs1/2

defineaschema

createaninstance

imagedata

Page 34: Crawling and Tracking Millions of eCommerce Products at Scale

Schema–SchemaFcs2/2

serializeittoJSON

failtovalidate

validatetheimage

convertittoprimiFve

Page 35: Crawling and Tracking Millions of eCommerce Products at Scale

Schema–Product1/2

source

mulFpleimages

Page 36: Crawling and Tracking Millions of eCommerce Products at Scale

Schema–Product2/2string

key-valuepairs

alistofstring

Page 37: Crawling and Tracking Millions of eCommerce Products at Scale

Product–Summary•  Howtomanageproducts?–  Schema

•  Defineaproduct•  Validateaproduct•  ConverttoandfromdifferentrepresentaFon

Website Python Python

PrimiFve JSONlanguagedependent

languageindependent

websitedependen

t

universal

crawl convert serialise

Page 38: Crawling and Tracking Millions of eCommerce Products at Scale

Store•  Database•  ORM•  Model•  Converter

Page 39: Crawling and Tracking Millions of eCommerce Products at Scale

Store–Database•  Whichdatabasetouse?– MySQL–  PostgreSQL– MongoDB– ApacheHbase– ApacheCassandra

•  Howtoeasilyswitchbetweendifferentdatabases?

Page 40: Crawling and Tracking Millions of eCommerce Products at Scale

Store–ORM•  Howtoaccessdatabase?–  PythonDatabaseAPISpecificaFon–  Object-RelaFonalMapping(ORM)

•  Efficient(produc<ve)•  Flexible

–  ORMs•  RDBMS–SQLAlchemy•  MongoDB–MongoEngine•  ApacheCassandra–DataStaxCQLDriver

Page 41: Crawling and Tracking Millions of eCommerce Products at Scale

Store–Model1/2

UTCFme

UniqueidenFfier

Page 42: Crawling and Tracking Millions of eCommerce Products at Scale

Store–Model2/2

OpFonalfields

JSONBforlist/dict

hash

Page 43: Crawling and Tracking Millions of eCommerce Products at Scale

Store–Converter•  Convertaproducttoarecord–  Computehashusingmerchant,country,andURL– Matchrelevantfieldsbynames–  TheopFonalfieldsgotoextra–  Created/updatedFmestamp(systemordatabase)

•  Convertarecordtoaproduct

Page 44: Crawling and Tracking Millions of eCommerce Products at Scale

Store–Summary•  Howtostoreproducts?– Model

•  Define/Validaproduct•  ConvertittoandfromproductrepresentaFon•  Saveittoandloaditfromadatabase

JSON Python

PrimiFvePythonRecord Database

languageindependent

languagedependent

universaldatabasedependent

deserialise convert save

Page 45: Crawling and Tracking Millions of eCommerce Products at Scale

Summary–Workflow•  Crawl->Process->Save

Website

Components

Page

Detail

Summary

Database

Schema

Validate

Convert

Model

Validate

Convert

Product Record

Category

Page 46: Crawling and Tracking Millions of eCommerce Products at Scale

Summary–Flexible•  Flexible(migraFon,split,parFFon,co-exist)

Website Schema Product Model Record MongoDB

Website Model Record RDBMS

Website Model Record Cassandra

Specific--->General General---->Specific

Page 47: Crawling and Tracking Millions of eCommerce Products at Scale

Summary–Scalable

Category

Database

category page

summary

Page

Summary

Save

Website

detail

Detail

1

2

3

45

queueschema

schemamodel

Page 48: Crawling and Tracking Millions of eCommerce Products at Scale

Conclusion•  Problem-solving–  Divide-and-conquer

•  Workflow–  Crawl–Component–Scalable–  Product–Schema-Flexible–  Store–Model-Flexible

•  System– Microservice–flexible&scalable


Recommended