Date post: | 17-Jan-2017 |
Category: |
Data & Analytics |
Upload: | vincent-michel |
View: | 314 times |
Download: | 0 times |
Data Science in E-commerce industry DSSP 2016/05/20Vincent Michel
Big Data Europe, BDD, Rakuten Inc. / PriceMinister
[email protected] @HowIMetYourData
2
Short Bio
ESPCI: engineer in Physics / Biology
ENS Cachan: MVA Master Mathematics Vision and Learning
INRIA Parietal team: PhD in Computer ScienceUnderstanding the visual cortex by using classification techniques
Logilab – Development and data science consultingData.bnf.fr (French National Library open-data platform)Brainomics (platform for heterogeneous medical data)
EducationExperience
Rakuten PriceMinister– Senior Developer and data scientistData engineer and data science consulting
Software engineeringLessons learned from (painful) experiences
4
Do not redo it yourself !
Lots of really interesting open-source libraries for all your needs:Test first on a small POC, then contribute/developScikit-learn, pandas, Caffe, Scikit-image, opencv, ….Be careful: it is really easy to do something wrong !
Open-data:More and more open-data for catalogs, …E.g. data.bnf.fr
~ 2.000.000 authors~ 200.000 works~ 200.000 topics
Contribute to open-source:Is there a need / pool of potential developers ?Do it well (documentation / test)Unless you are doing some kind of super magical algorithmMay bring you help, bug fixes, and engineers ! But it takes time and energy
5
Quality in data science software engineering
Never underestimates integration costReally easy to write a 20 lines Python code doing somefancy Random Forests… …that could be really hard to deploy (data pipeline, packaging, monitoring)Developer != DevOps != Sys admin
Make it clean from the start (> 2 days of dev or > 100 lines of code):Tests, tests, tests, tests, tests, tests, tests, …DocumentationPackaging / supervision / monitoringRelease often release earlierAgile development, Pull request, code versioning
Choose the right tool:Do you really need this super fancy NoSQL databaseto store your transactions?
6
Monitoring and metrics
Always monitor:Your development: continuous integration (Jenkins)Your service: nagios/shinkenYour business data (BI): KibanaYour user: trackerYour data science process : e.g. A/B test
Evaluation:Choose the right metricPrediction accuracy / Precision-recall …Always A/B test rather than relying on personal thoughtsGood question leads to good answer: Define your problem
Hiring remarksFinding the good data scientist
8
Finding your data scientist
Do not try to find a unicorn!
Define your needs(and unicorns no longer exist…)
9
Few remarks on hiring – my personal opinion
Be careful of CVs with buzzwords!E.g. “IT skills: SVM (linear, non-linear), Clustering (K-means, Hierarchical), Random Forests, Regularization (L1, L2, Elastic net…) …”It is like as someone saying “ IT skills: Python (for loop, if/else pattern, …)
Often found in Junior CVs (ok), but huge warning in Senior CVs
Hungry for data?Loving data is the most important thing to checkOpendata? Personal project? Curious about data? (Hackaton?)Pluridisciplinary == knowing how to handle various datasets
Check for IT skills:Should be able to install/develop new libraries/algorithmsA huge part of the job could be to format / cleanup the dataExperience VS education -> Autonomy
Recommendations @RakutenData science use-case
11
Rakuten Group Worldwide
Recommendationchallenges
Different languagesUsers behaviorBusiness areas
12
Rakuten Group in Numbers
Rakuten in Japan
> 12.000 employees> 48 billions euros of GMS> 100.000.000 users> 250.000.000 items> 40.000 merchants
Rakuten Group
Kobo 18.000.000 usersViki 28.000.000 usersViber 345.000.000 users
13
Rakuten Ecosystem
Rakuten global ecosystem :Member-based business model that connects Rakuten servicesRakuten ID common to various Rakuten servicesOnline shopping and services;
Main business areasE-commerceInternet financeDigital content
Recommendation challengesCross-servicesAggregated dataComplex users features
14
Rakuten’s e-commerce: B2B2C Business Model
Business to Business to Consumer:Merchants located in different regions / online virtual shopping mallMain profit sources
• Fixed fees from merchants• Fees based on each transaction and other service
Recommendationchallenges
Many shopsItems referencesGlobal catalog
15
Big Data Department @ Rakuten
Big Data Department150+ engineers – Japan / Europe / US
Missions
Development and operations of internal systems for:
RecommendationsSearchTargetingUser behavior tracking
Average traffic
> 100.000.000 events / day> 40.000.000 items view / day> 50.000.000 search / day> 750.000 purchases / day
Technology stackJava / Python / RubySolr / LuceneCassandra / CouchbaseHadoop / Hive / PigRedis / Kafka
16
Recommendations on Rakuten Marketplaces
Non-personalized recommendationsAll-shop recommendations:
Item to itemUser to item
In-shop recommendationsReview-based recommendations
Personalized recommendationsPurchase history recommendationsCart add recommendationsOrder confirmation recommendations
System status and scaleIn production in over 35 services of Rakuten Group worldwideSeveral hundreds of servers running:
HadoopCassandraAPIS
17
Challenges in Recommendations
ItemsCatalogue
ItemsSimilarity
Recommendationsengine
EvaluationProcess
Items cataloguesCatalogue for multiple shops with different items
references ?Items similarity / distances
Cross services aggregation ?Lots of parameters ?
Recommendations engineBest / optimal recommendations logic ?
Evaluation processOffline / online evaluation ?Long-tail ? KPI ?
18
Recommendations Architecture: Constantly Evolving
BrowsingEvents
Cocounts Storage
PurchaseEvents
Cat
alog
ue(s
)
Dis
tribu
tion
laye
r
RecommendationsOffline / materialized
RecommendationsOnline algebra / multi-arm
19
Items Catalogues
Use different levels of aggregation to improve recommendations
Category-level(e.g. food, soda, clothes, …)
Product-level(manufactured items)
Item in shop-level(specific product sell by a specific shop)
Increased statistical power in co-events computation
Easier business handling(picking the good item)
20
Enriching Catalogues using Record Linkage
Marketplace 2Marketplace 1 Reference database
Record linkage Use external sources (e.g., Wikidata) to align markets' products Fuzzy matching of 600K vs 350K items for movies alignments usecase. Blocking algorithm
Cross recommendation Global catalog Items aggregation Helps with cold start issues Improved navigation
21
Co-occurrences and Similarities Computation
Only access to unitary data (purchase / browsing)
Use co-occurrences for computing items similarity
Multiple possible parameters: Size of time window to be considered:
Does browsing and purchase data reflect similar behavior ?
Threshold on co-occurrencesIs one co-occurrence significant enough to be used ? Two ? Three ?
Symmetric or asymmetricIs the order important in the co-occurrence ? A then B == B then A ?
Similarity metricsWhich similarity metrics to be used based on the co-occurrences ?
22
Co-occurrences Example
Browsing
Purchase
Session ? Session ?Time window 1
Session ?Time window 2
07/11/2015 08/11/2015
08/11/2015
24/11/2015
08/11/2015
08/11/2015
10/09/2015
08/09/2015
10/09/2015
23
Co-occurrences Computation
Co-purchases
Co-browsing
Classical co-occurrences
Complementaryitems
Substituteitems
Other possible co-occurrences
Items browsed and bought together
Items browsed and not bought together
“You may also want…”
“Similar items…”
08/11/2015
08/11/2015
08/11/2015
07/11/2015
08/11/201510/09/2015
08/09/2015
07/11/2015
24
Recommendation Quality Challenges
Recommendations categories
Cold start issue• External data ?• Cross-services ?
Hot products (A)• Top-N items ?
Short tail (B)
Long tail (C + D)
Minor Product
Major Product
(Popular)New Product
OldProduct
(A)(B)
(D)
(C)
25
Long Tail is Fat
Long tail numbers
• Most of the items are long tail• They still represent a large
portion of the traffic
Long tail approaches
• Content-based• Aggregation / clustering• Personalization
Popular
Short tail
Long tail
Browsing share Number of items
Long tail Short tail Popular
26
Recommendations Offline Evaluation
Pros/Cons
• Convenient way to try new ideas
• Fast and cheap• But hard to align
with online KPI
Approaches
• Rescoring• Prediction game• Business simulator
27
Public Initiative – Viki Recommendation Challenge
567 submissions from 132 participantshttp://www.dextra.sg/challenges/rakuten-viki-video-challenge
28
Datascience everywhere !
Rakuten provides marketplaces worldwide
Specific challenges for recommendations
Items catalogue: reinforce statistical power of co-occurrences across shops and services;
Items similarities: find the good parameters for the different use-cases;
Recommendations models: what is the best models for in-shop, all-shops, personalization?
Evaluation: handling long-tail? Comparing different models?
29
THANKS !
Questions ?
More on Rakuten tech initiatives
http://www.slideshare.net/rakutentechhttp://rit.rakuten.co.jp/oss.html
http://rit.rakuten.co.jp/opendata.html
Positions
• http://global.rakuten.com/corp/careers/bigdata/• http://www.priceminister.com/recrutement/?p=197
30
We are Hiring!
Big Data Department – team in Parishttp://global.rakuten.com/corp/careers/bigdata/
http://www.priceminister.com/recrutement/?p=197
Data Scientist / Software Developer
Build algorithms for recommendations, search, targeting Predictive modeling, machine learning, natural language processing Working close to business Python, Java, Hadoop, Couchbase, Cassandra…
Also hiring: search engine developers, big data system administrators, etc.