Crawling and Tracking Millions of eCommerce Products at Scale

Post on 08-Jan-2017

68 views 1 download

transcript

CrawlingandTrackingMillionsofeCommerceProductsatScale

QiaoliangXiangHeadofDataScience

qiaoliang@shopback.com,qiaoliangxiang@gmail.com

Outline•  IntroducFon•  Crawl•  Product•  Store•  Conclusion

IntroducFon–Stores

IntroducFon–Store

IntroducFon–Products

IntroducFon–Problem•  Problem– Howtocrawlproducts?

•  Goals–  Flexible–  Scalable

Crawl Process Save

Workflow

Crawl•  Workflow–  Traversal–  Fetch–  Extract

•  System

Crawl–Workflow•  Workflow–  Traverseawebsitetogetproductlinks–  Fetchproductdata(i.e.,HTML)–  Extractproducts

Traverse Fetch ExtractWebsite ProductStore

Workflow–DomainIndependent•  Workflow–  TraversewebsiteandfetchHTMLs–  Extractproducts(domainknowledge)

Website Traverse&Fetch HTMLStore

ProductStoreExtract

Workflow–DomainIndependent•  Traverse&Fetch–  SeedURL–  FetchHTML–  ExtractLinks–  Removeduplicateandvisitedlinks

SeedURL HTML ExtractedLinks UnvisitedLinks

Workflow–DomainDependent•  Workflow–  Traverseproductlinks

•  Fetchproductpage•  Extractproduct

Website Traverse ProductLinks

ProductStoreFetch&Extract

Traversal•  Howtotraverseawebsiteefficientlyandflexibly?–  PaVerns–  Traversal–  Components

Traverse Fetch Extract

Workflow

Traversal–PaVerns–Categories•  Mostwebsitesusecategoriestoorganizeproducts

Traversal–PaVerns–Summaries•  Productsummarypagesarelistedpage-by-page

Productsummary PaginaFon

Traversal–PaVerns-Structure•  Categories->Pages->Summaries->Details

Page

Category Category

Website

Page PagePage

Summary Summary SummarySummarySummary Summary

Detail Detail DetailDetailDetail Detail

Traversal–Depth-first•  Category->Page->Summary->Detail

Page

Category Category

Website

Page PagePage

Summary Summary SummarySummarySummary Summary

Detail Detail DetailDetailDetail Detail

Traversal–Breadth-first•  Categories|Pages|Summaries|Details

Page

Category Category

Website

Page PagePage

Summary Summary SummarySummarySummary Summary

Detail Detail DetailDetailDetail Detail

Traversal–Components•  Category•  Page•  Summary•  Detail

Website Categorycrawler Categorylinks

Categorylink Pagecrawler Pagelinks

Pagelink Summarycrawler Products(summaries)

Productlink Detailcrawler Product(detail)

ComponentCrawler:one-to-many

Crawl–Fetch•  Whattofetch?– API:JSON,CSV,XML,HTML,etc.– HTML

•  requests–HTTPforHumans–  Requests:GETandPOST–  SessionandCookies–  Streamingdownloads Traverse Fetch Extract

Workflow

Crawl–Extract•  Howtoextractproducts?–  API:JSON,CSV,XML–  HTML

•  JavaScriptObject•  DocumentObjectModel(DOM)

•  Tools–  re–regularexpressionoperaFons–  BeauFfulSoup–extractdatafromHTML/XML

Crawl–Extract–BestPracFces•  BestPracFces

–  APIData•  JSON,CSV,XML

–  JavaScriptData•  JavaScriptObject

–  HTMLString•  Useregex(i.e.,price,link)

–  HTMLDOM•  Opengraphtags•  CSS:id>class

Crawl–System•  IDE:PyCharm•  Python:3.5•  CodingStyle:PEP8•  DocumentaFon:sphinx•  Logging:logging•  Unittest:pytest•  Deploy:fabric•  AWS:boto3

Crawl–System–Scale

Category

Database

category page

summary

Page

Summary

Save

Website

detail

Detail

1

2

3

45

queue

Crawl–Conclusion•  Crawl– DomainKnowledge–Efficient–  Component–Flexible&Reusable– Microservice–Scale

Website Category ProductStoreSummaryPage Detail

Components

Product•  Aconsistentrepresentofincompa<bleproducts?– DefiniFon–  RepresentaFon–  Schema ?

DefiniFon–Product&Field•  What’saproduct?– Aproductisagroup/bagoffields

•  What’safield?– Afieldisanaspectofaproduct– Afieldhasanameandavalue–  Example:name,URL,images,price,etc.

DefiniFon–FieldGuidelines•  Howtodefineafield?– Name

•  Unique,informaFve,meaningful–  Value

•  FollowthebestpracFces(i.e.,country,currency,URL)•  Bespecifictomaintainconsistency•  Begeneraltodealwithdifferentformats

DefiniFon–FieldGroup•  Whatarethefields?–  Thenumberoffieldsispossiblyinfinite!–  FieldGroup

•  Relevant:relevanttobusiness•  OpFonal:notrequiredbutgoodtohave

FieldGroup Relevanttobusiness Availableinwebsites ExamplesRelevant Yes Yes Name,Price,OpFonal No Maybe Stocklevel,Review

DefiniFon–RelevantFieldsName Type Descrip<on ExampleName String NameorFtle AppleiPhone6URL URL Productpagelink hVp://www...Images URLs Imagelinks hVp://www…,…Currency String Currencycode SGDPrice String Priceorrange 888.8-1048Originalprice String Originalpriceorrange 1111.5Category Strings Categorypath(levels) Mobiles,SmartPhoneBrand String Brandname AppleDescripFon Strings Listofparagraphs ReFnaHD,3DTouchAVributes Map Key-valuepairs Color:grey,Memory:16G

RepresentaFon–Python•  HowtorepresentaproductinPython?–  Richindatatypes

•  None,bool,int,float,str,list,dict,dateFme,date,…–  RepresentaFon

•  dict–  ProperFes

•  Flexible•  Notuniversal

RepresentaFon–JSON•  HowtorepresentaproductinJSON?–  Lessbutpowerfuldatatypes

•  null,true/false,number,string,array,object–  RepresentaFon

•  object–  ProperFes

•  Text•  Data-interchange•  Widelysupported

Schema–IntroducFon•  Schema–  Definethefieldsofaproduct–  Validateaproduct(dict)–  Convert:dict<->primiFvedict–  Serialise/deserialise:primiFvedict<->JSON

•  Tools–  JSONSchema:JSON–  Schema:Python–  Schema<cs:PythonORM

DefiniFon

RepresentaFon

Schema

Schema–SchemaFcs1/2

defineaschema

createaninstance

imagedata

Schema–SchemaFcs2/2

serializeittoJSON

failtovalidate

validatetheimage

convertittoprimiFve

Schema–Product1/2

source

mulFpleimages

Schema–Product2/2string

key-valuepairs

alistofstring

Product–Summary•  Howtomanageproducts?–  Schema

•  Defineaproduct•  Validateaproduct•  ConverttoandfromdifferentrepresentaFon

Website Python Python

PrimiFve JSONlanguagedependent

languageindependent

websitedependen

t

universal

crawl convert serialise

Store•  Database•  ORM•  Model•  Converter

Store–Database•  Whichdatabasetouse?– MySQL–  PostgreSQL– MongoDB– ApacheHbase– ApacheCassandra

•  Howtoeasilyswitchbetweendifferentdatabases?

Store–ORM•  Howtoaccessdatabase?–  PythonDatabaseAPISpecificaFon–  Object-RelaFonalMapping(ORM)

•  Efficient(produc<ve)•  Flexible

–  ORMs•  RDBMS–SQLAlchemy•  MongoDB–MongoEngine•  ApacheCassandra–DataStaxCQLDriver

Store–Model1/2

UTCFme

UniqueidenFfier

Store–Model2/2

OpFonalfields

JSONBforlist/dict

hash

Store–Converter•  Convertaproducttoarecord–  Computehashusingmerchant,country,andURL– Matchrelevantfieldsbynames–  TheopFonalfieldsgotoextra–  Created/updatedFmestamp(systemordatabase)

•  Convertarecordtoaproduct

Store–Summary•  Howtostoreproducts?– Model

•  Define/Validaproduct•  ConvertittoandfromproductrepresentaFon•  Saveittoandloaditfromadatabase

JSON Python

PrimiFvePythonRecord Database

languageindependent

languagedependent

universaldatabasedependent

deserialise convert save

Summary–Workflow•  Crawl->Process->Save

Website

Components

Page

Detail

Summary

Database

Schema

Validate

Convert

Model

Validate

Convert

Product Record

Category

Summary–Flexible•  Flexible(migraFon,split,parFFon,co-exist)

Website Schema Product Model Record MongoDB

Website Model Record RDBMS

Website Model Record Cassandra

Specific--->General General---->Specific

Summary–Scalable

Category

Database

category page

summary

Page

Summary

Save

Website

detail

Detail

1

2

3

45

queueschema

schemamodel

Conclusion•  Problem-solving–  Divide-and-conquer

•  Workflow–  Crawl–Component–Scalable–  Product–Schema-Flexible–  Store–Model-Flexible

•  System– Microservice–flexible&scalable