Date post: | 08-Jan-2017 |
Category: |
Technology |
Upload: | qiaoliang-xiang |
View: | 68 times |
Download: | 1 times |
CrawlingandTrackingMillionsofeCommerceProductsatScale
QiaoliangXiangHeadofDataScience
Outline• IntroducFon• Crawl• Product• Store• Conclusion
IntroducFon–Stores
IntroducFon–Store
IntroducFon–Products
IntroducFon–Problem• Problem– Howtocrawlproducts?
• Goals– Flexible– Scalable
Crawl Process Save
Workflow
Crawl• Workflow– Traversal– Fetch– Extract
• System
Crawl–Workflow• Workflow– Traverseawebsitetogetproductlinks– Fetchproductdata(i.e.,HTML)– Extractproducts
Traverse Fetch ExtractWebsite ProductStore
Workflow–DomainIndependent• Workflow– TraversewebsiteandfetchHTMLs– Extractproducts(domainknowledge)
Website Traverse&Fetch HTMLStore
ProductStoreExtract
Workflow–DomainIndependent• Traverse&Fetch– SeedURL– FetchHTML– ExtractLinks– Removeduplicateandvisitedlinks
SeedURL HTML ExtractedLinks UnvisitedLinks
Workflow–DomainDependent• Workflow– Traverseproductlinks
• Fetchproductpage• Extractproduct
Website Traverse ProductLinks
ProductStoreFetch&Extract
Traversal• Howtotraverseawebsiteefficientlyandflexibly?– PaVerns– Traversal– Components
Traverse Fetch Extract
Workflow
Traversal–PaVerns–Categories• Mostwebsitesusecategoriestoorganizeproducts
Traversal–PaVerns–Summaries• Productsummarypagesarelistedpage-by-page
Productsummary PaginaFon
Traversal–PaVerns-Structure• Categories->Pages->Summaries->Details
Page
Category Category
Website
Page PagePage
Summary Summary SummarySummarySummary Summary
Detail Detail DetailDetailDetail Detail
Traversal–Depth-first• Category->Page->Summary->Detail
Page
Category Category
Website
Page PagePage
Summary Summary SummarySummarySummary Summary
Detail Detail DetailDetailDetail Detail
Traversal–Breadth-first• Categories|Pages|Summaries|Details
Page
Category Category
Website
Page PagePage
Summary Summary SummarySummarySummary Summary
Detail Detail DetailDetailDetail Detail
Traversal–Components• Category• Page• Summary• Detail
Website Categorycrawler Categorylinks
Categorylink Pagecrawler Pagelinks
Pagelink Summarycrawler Products(summaries)
Productlink Detailcrawler Product(detail)
ComponentCrawler:one-to-many
Crawl–Fetch• Whattofetch?– API:JSON,CSV,XML,HTML,etc.– HTML
• requests–HTTPforHumans– Requests:GETandPOST– SessionandCookies– Streamingdownloads Traverse Fetch Extract
Workflow
Crawl–Extract• Howtoextractproducts?– API:JSON,CSV,XML– HTML
• JavaScriptObject• DocumentObjectModel(DOM)
• Tools– re–regularexpressionoperaFons– BeauFfulSoup–extractdatafromHTML/XML
Crawl–Extract–BestPracFces• BestPracFces
– APIData• JSON,CSV,XML
– JavaScriptData• JavaScriptObject
– HTMLString• Useregex(i.e.,price,link)
– HTMLDOM• Opengraphtags• CSS:id>class
Crawl–System• IDE:PyCharm• Python:3.5• CodingStyle:PEP8• DocumentaFon:sphinx• Logging:logging• Unittest:pytest• Deploy:fabric• AWS:boto3
Crawl–System–Scale
Category
Database
category page
summary
Page
Summary
Save
Website
detail
Detail
1
2
3
45
queue
Crawl–Conclusion• Crawl– DomainKnowledge–Efficient– Component–Flexible&Reusable– Microservice–Scale
Website Category ProductStoreSummaryPage Detail
Components
Product• Aconsistentrepresentofincompa<bleproducts?– DefiniFon– RepresentaFon– Schema ?
DefiniFon–Product&Field• What’saproduct?– Aproductisagroup/bagoffields
• What’safield?– Afieldisanaspectofaproduct– Afieldhasanameandavalue– Example:name,URL,images,price,etc.
DefiniFon–FieldGuidelines• Howtodefineafield?– Name
• Unique,informaFve,meaningful– Value
• FollowthebestpracFces(i.e.,country,currency,URL)• Bespecifictomaintainconsistency• Begeneraltodealwithdifferentformats
DefiniFon–FieldGroup• Whatarethefields?– Thenumberoffieldsispossiblyinfinite!– FieldGroup
• Relevant:relevanttobusiness• OpFonal:notrequiredbutgoodtohave
FieldGroup Relevanttobusiness Availableinwebsites ExamplesRelevant Yes Yes Name,Price,OpFonal No Maybe Stocklevel,Review
DefiniFon–RelevantFieldsName Type Descrip<on ExampleName String NameorFtle AppleiPhone6URL URL Productpagelink hVp://www...Images URLs Imagelinks hVp://www…,…Currency String Currencycode SGDPrice String Priceorrange 888.8-1048Originalprice String Originalpriceorrange 1111.5Category Strings Categorypath(levels) Mobiles,SmartPhoneBrand String Brandname AppleDescripFon Strings Listofparagraphs ReFnaHD,3DTouchAVributes Map Key-valuepairs Color:grey,Memory:16G
RepresentaFon–Python• HowtorepresentaproductinPython?– Richindatatypes
• None,bool,int,float,str,list,dict,dateFme,date,…– RepresentaFon
• dict– ProperFes
• Flexible• Notuniversal
RepresentaFon–JSON• HowtorepresentaproductinJSON?– Lessbutpowerfuldatatypes
• null,true/false,number,string,array,object– RepresentaFon
• object– ProperFes
• Text• Data-interchange• Widelysupported
Schema–IntroducFon• Schema– Definethefieldsofaproduct– Validateaproduct(dict)– Convert:dict<->primiFvedict– Serialise/deserialise:primiFvedict<->JSON
• Tools– JSONSchema:JSON– Schema:Python– Schema<cs:PythonORM
DefiniFon
RepresentaFon
Schema
Schema–SchemaFcs1/2
defineaschema
createaninstance
imagedata
Schema–SchemaFcs2/2
serializeittoJSON
failtovalidate
validatetheimage
convertittoprimiFve
Schema–Product1/2
source
mulFpleimages
Schema–Product2/2string
key-valuepairs
alistofstring
Product–Summary• Howtomanageproducts?– Schema
• Defineaproduct• Validateaproduct• ConverttoandfromdifferentrepresentaFon
Website Python Python
PrimiFve JSONlanguagedependent
languageindependent
websitedependen
t
universal
crawl convert serialise
Store• Database• ORM• Model• Converter
Store–Database• Whichdatabasetouse?– MySQL– PostgreSQL– MongoDB– ApacheHbase– ApacheCassandra
• Howtoeasilyswitchbetweendifferentdatabases?
Store–ORM• Howtoaccessdatabase?– PythonDatabaseAPISpecificaFon– Object-RelaFonalMapping(ORM)
• Efficient(produc<ve)• Flexible
– ORMs• RDBMS–SQLAlchemy• MongoDB–MongoEngine• ApacheCassandra–DataStaxCQLDriver
Store–Model1/2
UTCFme
UniqueidenFfier
Store–Model2/2
OpFonalfields
JSONBforlist/dict
hash
Store–Converter• Convertaproducttoarecord– Computehashusingmerchant,country,andURL– Matchrelevantfieldsbynames– TheopFonalfieldsgotoextra– Created/updatedFmestamp(systemordatabase)
• Convertarecordtoaproduct
Store–Summary• Howtostoreproducts?– Model
• Define/Validaproduct• ConvertittoandfromproductrepresentaFon• Saveittoandloaditfromadatabase
JSON Python
PrimiFvePythonRecord Database
languageindependent
languagedependent
universaldatabasedependent
deserialise convert save
Summary–Workflow• Crawl->Process->Save
Website
Components
Page
Detail
Summary
Database
Schema
Validate
Convert
Model
Validate
Convert
Product Record
Category
Summary–Flexible• Flexible(migraFon,split,parFFon,co-exist)
Website Schema Product Model Record MongoDB
Website Model Record RDBMS
Website Model Record Cassandra
Specific--->General General---->Specific
Summary–Scalable
Category
Database
category page
summary
Page
Summary
Save
Website
detail
Detail
1
2
3
45
queueschema
schemamodel
Conclusion• Problem-solving– Divide-and-conquer
• Workflow– Crawl–Component–Scalable– Product–Schema-Flexible– Store–Model-Flexible
• System– Microservice–flexible&scalable