+ All Categories
Home > Data & Analytics > Telecom datascience master_public

Telecom datascience master_public

Date post: 17-Jan-2017
Category:
Upload: vincent-michel
View: 56 times
Download: 0 times
Share this document with a friend
43
Data Science in E-commerce industry Telecom Paris - Séminaires Big Data 2016/06/09 Vincent Michel Big Data Europe, BDD, Rakuten Inc. / PriceMinister [email protected] @HowIMetYourData
Transcript
Page 1: Telecom datascience master_public

Data Science in E-commerce industry Telecom Paris - Séminaires Big Data 2016/06/09Vincent Michel

Big Data Europe, BDD, Rakuten Inc. / PriceMinister

[email protected] @HowIMetYourData

Page 2: Telecom datascience master_public

2

Short Bio

ESPCI: engineer in Physics / Biology

ENS Cachan: MVA Master Mathematics Vision and Learning

INRIA Parietal team: PhD in Computer ScienceUnderstanding the visual cortex by using classification techniques

Logilab – Development and data science consultingData.bnf.fr (French National Library open-data platform)Brainomics (platform for heterogeneous medical data)

EducationExperience

Rakuten PriceMinister– Senior Developer and data scientistData engineer and data science consulting

Page 3: Telecom datascience master_public

Software engineeringLessons learned from (painful) experiences

Page 4: Telecom datascience master_public

4

Do not redo it yourself !

Lots of interesting open-source libraries for all your needs:Test first on a small POC, then contribute/developScikit-learn, pandas, Caffe, Scikit-image, opencv, ….Be careful: it is easy to do something wrong !

Open-data:More and more open-data for catalogs, …E.g. data.bnf.fr

~ 2.000.000 authors~ 200.000 works~ 200.000 topics

Contribute to open-source:Is there a need / pool of potential developers ?Do it well (documentation / test)Unless you are doing some kind of super magical algorithmMay bring you help, bug fixes, and engineers ! But it takes time and energy

Page 5: Telecom datascience master_public

5

Quality in data science software engineering

Never underestimates integration costEasy to write a 20 lines Python code doing somefancy Random Forests… …that could be hard to deploy (data pipeline, packaging, monitoring)Developer != DevOps != Sys admin

Make it clean from the start (> 2 days of dev or > 100 lines of code):Tests, tests, tests, tests, tests, tests, tests, …DocumentationPackaging / supervision / monitoringRelease often release earlierAgile development, Pull request, code versioning

Choose the right tool:Do you really need this super fancy NoSQL databaseto store your transactions?

Page 6: Telecom datascience master_public

6

Monitoring and metrics

Always monitor:Your development: continuous integration (Jenkins)Your service: nagios/shinkenYour business data (BI): KibanaYour user: trackerYour data science process : e.g. A/B test

Evaluation:Choose the right metricPrediction accuracy / Precision-recall …Always A/B test rather than relying on personal thoughtsGood question leads to good answer: Define your problem

Page 7: Telecom datascience master_public

Hiring remarksSelling yourself as a (good) data scientist

Page 8: Telecom datascience master_public

8

Few remarks on hiring – my personal opinion

Be careful of CVs with buzzwords!E.g. “IT skills: SVM (linear, non-linear), Clustering (K-means, Hierarchical), Random Forests, Regularization (L1, L2, Elastic net…) …”It is like as someone saying “ IT skills: Python (for loop, if/else pattern, …)

Often found in Junior CVs (ok), but huge warning in Senior CVs

Hungry for data?Loving data is the most important thing to checkOpendata? Personal project? Curious about data? (Hackaton?)Pluridisciplinary == knowing how to handle various datasets

Check for IT skills:Should be able to install/develop new libraries/algorithmsA huge part of the job could be to format / cleanup the dataExperience VS education -> Autonomy

Page 9: Telecom datascience master_public

Recommendations @RakutenData science use-case

Page 10: Telecom datascience master_public

10

Rakuten Group Worldwide

Recommendationchallenges

Different languagesUsers behaviorBusiness areas

Page 11: Telecom datascience master_public

11

Rakuten Group in Numbers

Rakuten in Japan

> 12.000 employees> 48 billions euros of GMS> 100.000.000 users> 250.000.000 items> 40.000 merchants

Rakuten Group

Kobo 18.000.000 usersViki 28.000.000 usersViber 345.000.000 users

Page 12: Telecom datascience master_public

12

Rakuten Ecosystem

Rakuten global ecosystem :Member-based business model that connects Rakuten servicesRakuten ID common to various Rakuten servicesOnline shopping and services;

Main business areasE-commerceInternet financeDigital content

Recommendation challengesCross-servicesAggregated dataComplex users features

Page 13: Telecom datascience master_public

13

Rakuten’s e-commerce: B2B2C Business Model

Business to Business to Consumer:Merchants located in different regions / online virtual shopping mallMain profit sources

• Fixed fees from merchants• Fees based on each transaction and other service

Recommendationchallenges

Many shopsItems referencesGlobal catalog

Page 14: Telecom datascience master_public

14

Big Data Department @ Rakuten

Big Data Department150+ engineers – Japan / Europe / US

Missions

Development and operations of internal systems for:

RecommendationsSearchTargetingUser behavior tracking

Average traffic

> 100.000.000 events / day> 40.000.000 items view / day> 50.000.000 search / day> 750.000 purchases / day

Technology stackJava / Python / RubySolr / LuceneCassandra / CouchbaseHadoop / Hive / PigRedis / Kafka

Page 15: Telecom datascience master_public

15

Recommendations on Rakuten Marketplaces

Non-personalized recommendationsAll-shop recommendations:

Item to itemUser to item

In-shop recommendationsReview-based recommendations

Personalized recommendationsPurchase history recommendationsCart add recommendationsOrder confirmation recommendations

System status and scaleIn production in over 35 services of Rakuten Group worldwideSeveral hundreds of servers running:

HadoopCassandraAPIS

Page 16: Telecom datascience master_public

RecommendationsThe big picture

Page 17: Telecom datascience master_public

17

Challenges in Recommendations

ItemsCatalogue

ItemsSimilarity

Recommendationsengine

EvaluationProcess

Items cataloguesCatalogue for multiple shops with different items

references ?Items similarity / distances

Cross services aggregation ?Lots of parameters ?

Recommendations engineBest / optimal recommendations logic ?

Evaluation processOffline / online evaluation ?Long-tail ? KPI ?

Page 18: Telecom datascience master_public

18

Recommendations Architecture: Constantly Evolving

BrowsingEvents

Cocounts Storage

PurchaseEvents

Cat

alog

ue(s

)

Dis

tribu

tion

laye

r

RecommendationsOffline / materialized

RecommendationsOnline algebra / multi-arm

Page 19: Telecom datascience master_public

19

Items Catalogues

Use different levels of aggregation to improve recommendations

Category-level(e.g. food, soda, clothes, …)

Product-level(manufactured items)

Item in shop-level(specific product sell by a specific shop)

Increased statistical power in co-events computation

Easier business handling(picking the good item)

Page 20: Telecom datascience master_public

20

Enriching Catalogues using Record Linkage

Marketplace 2Marketplace 1 Reference database

Record linkage Use external sources (e.g., Wikidata) to align markets' products Fuzzy matching of 600K vs 350K items for movies alignments usecase. Blocking algorithm

Cross recommendation Global catalog Items aggregation Helps with cold start issues Improved navigation

Page 21: Telecom datascience master_public

21

Semantic-web and RDF format

Triples: <subject> <relation> <object>URI: unique identifier

http://dbpedia.org/page/Terminator_2:_Judgment_Day

Page 22: Telecom datascience master_public

RecommendationsCocounts and matrixes

Page 23: Telecom datascience master_public

23

Recommendation datatypes

RatingsNumerical feedbacks from the usersSources: Stars, reviews, …

✔ Qualitative and valuable data✖ Hard to obtainScaling and normalization !

Users

Item

s

1 3 2

5 2

2 4 1

3 1 5

4 4 1 3

Unitary dataOnly 0/1 without any quality feedbackSources: Click, purchase…

✔ Easy to obtain (e.g. tracker)✖ No direct rating

Users

Item

s1 1 1

1 1

1 1 1

1 1 1

1 1 1 1

Page 24: Telecom datascience master_public

24

Collaborative filtering

User-user#items < #usersItems are changing quickly

Users

Item

s

1 3 2

5 2

2 4 1

3 1 5

4 4 1 3

?

1 – Compute users similarities(cosine-similarity, Pearson)

2 – Weighted average of ratings

Item-item#items >> #users

Page 25: Telecom datascience master_public

25

Matrix factorization

Users

Item

s

1 3 2

5 2

2 4 1

3 1 5

4 4 1 3

-0.7 1 0.4……………

2.3 0.2 -0.3

Item

s

0.5 0.3 … 1.2 …

1.2 -0.2 … -3.2

Users

~ X

Choose a number of latent variables to decompose the data

Predict new rating using the product of latent vectors

Use gradient descent technics (e.g. SGD)

Add some regularization

Page 26: Telecom datascience master_public

26

Matrix factorization – MovieLens example

Read filesimport csvmovies_fname = '/path/ml-latest/movies.csv'with open(movies_fname) as fobj: movies = dict((r[0], r[1]) for r in csv.reader(fobj))ratings_fname = ’/path/ml-latest/ratings.csv'with open(ratings_fname) as fobj: header = fobj.next() ratings = [(r[0], movies[r[1]], float(r[2])) for r in csv.reader(fobj)]

Build sparse matriximport scipy.sparse as spuser_idx, item_idx = {}, {}data, rows, cols = [], [], []for u, i, s in ratings: rows.append(user_idx.setdefault(u, len(user_idx))) cols.append(item_idx.setdefault(i, len(item_idx))) data.append(s)ratings = sp.csr_matrix((data, (rows, cols)))reverse_item_idx = dict((v, k) for k, v in item_idx.iteritems())reverse_user_idx = dict((v, k) for k, v in user_idx.iteritems())

Page 27: Telecom datascience master_public

27

Matrix factorization – MovieLens example

Fit Non-negative Matrix Factorizationfrom sklearn.decomposition import NMFnmf = NMF(n_components=50)user_mat = nmf.fit_transform(ratings)item_mat = nmf.components_

Plot resultscomponent_ind = 3component = [(reverse_item_idx[i], s)

for i, s in enumerate(item_mat[component_ind , :]) if s>0.] For movie, score in sorted(component, key=lambda x: x[1], reverse=True)[:10]: print movie, round(score)

Terminator 2: Judgment Day (1991) 24.0Terminator, The (1984) 23.0Die Hard (198 19.0Aliens (1986) 17.0Alien (1979) 16.0

Exorcist, The (1973) 8.0Halloween (197 7.0Nightmare on Elm Street, A (1984) 7.0Shining, The (1980) 7.0Carrie (1976) 7.0

Star Trek II: The Wrath of Khan (1982) 10.0Star Trek: First Contact (1996) 10.0Star Trek IV: The Voyage Home (1986) 9.0Contact (1997) 8.0Star Trek VI: The Undiscovered Country (1991) 8.0Blade Runner (1982) 8.0

Page 28: Telecom datascience master_public

28

Binary / Unitary data

Only occurences of items views/purchases/…

Jaccard distance

Cosine similarity

Conditional probability

Page 29: Telecom datascience master_public

29

Co-occurrences and Similarities Computation

Only access to unitary data (purchase / browsing)

Use co-occurrences for computing items similarity

Multiple possible parameters: Size of time window to be considered:

Does browsing and purchase data reflect similar behavior ?

Threshold on co-occurrencesIs one co-occurrence significant enough to be used ? Two ? Three ?

Symmetric or asymmetricIs the order important in the co-occurrence ? A then B == B then A ?

Similarity metricsWhich similarity metrics to be used based on the co-occurrences ?

Page 30: Telecom datascience master_public

30

Co-occurrences Example

Browsing

Purchase

Session ? Session ?Time window 1

Session ?Time window 2

07/11/2015 08/11/2015

08/11/2015

24/11/2015

08/11/2015

08/11/2015

10/09/2015

08/09/2015

10/09/2015

Page 31: Telecom datascience master_public

31

Co-occurrences Computation

Co-purchases

Co-browsing

Classical co-occurrences

Complementaryitems

Substituteitems

Other possible co-occurrences

Items browsed and bought together

Items browsed and not bought together

“You may also want…”

“Similar items…”

08/11/2015

08/11/2015

08/11/2015

07/11/2015

08/11/201510/09/2015

08/09/2015

07/11/2015

Page 32: Telecom datascience master_public

RecommendationsDevelopment and evaluation

Page 33: Telecom datascience master_public

33

Recommendations Algebra

Algebra for defining and combining recommendations engines

Keys ideasReuse already existing logics and combine them easily.Write business logic, not code ! Handle multiple input/output formats.

Available LogicsContent-basedCollaborative-filteringItem-itemUser-item

(personalization)

Available BackendsIn-memoryHDF5 filesCassandraCouchbase

Available HybridizationLinear algebra /

weightingMixedCascade enginesMeta-level

Page 34: Telecom datascience master_public

34

Python Algebra Example

Purchase-basedTop-20

AsymmetricConditional probability

Browsing-basedSimilarity > 0.01

SymmetricCosine similarity

+ 0.2 Composite engine

>>> engine1 = RecommendationsEngine(nb_recos=20, datatype=‘purchase’, asymmetric=True, distance=‘conditional_probability’)>>> engine2 = RecommendationsEngine(similarity_th=0.01, datatype=‘browsing’, asymmetric=False,

distance=‘cosine_similarity’)>>> composite_engine = engine1 + 0.2 * engine2

Get recommendations from items (item-to-item)

>>> recos = composite_engine.recommendations_by_items([123, 456, 789, …])

Page 35: Telecom datascience master_public

35

Python Algebra with Personalization

Purchase-basedTop-20

AsymmetricConditional probability

Browsing-basedSimilarity > 0.01

SymmetricCosine similarity

+ 0.2 Composite engine

Purchase-historyTime window 180 days

Time decay 0.01

>>> history = HistoryEngine(datatype=‘purchase’, time_window=180, time_decay=0.01)>>> engine1.register_history_engine(history)

…same code as previously (user-to-item)

>>> recos = composite_engine.recommendations_by_user(‘userid’)

Page 36: Telecom datascience master_public

36

Python Algebra – Complete Example

Purchase-basedTop-20

AsymmetricConditional probability

Browsing-basedSimilarity > 0.01

SymmetricCosine similarity

+ 0.2 Composite engine

Purchase-historyTime window 180 days

Time decay 0.01

X (cascade)

Purchase-basedCategory-level

Similarity > 0.01Asymmetric

Conditional probability

Browsing-basedCategory-levelSimilarity > 0.1

SymmetricCosine similarity

+ 0.1

Composite engine

Page 37: Telecom datascience master_public

37

Recommendation Quality Challenges

Recommendations categories

Cold start issue• External data ?• Cross-services ?

Hot products (A)• Top-N items ?

Short tail (B)

Long tail (C + D)

Minor Product

Major Product

(Popular)New Product

OldProduct

(A)(B)

(D)

(C)

Page 38: Telecom datascience master_public

38

Long Tail is Fat

Long tail numbers

• Most of the items are long tail• They still represent a large

portion of the traffic

Long tail approaches

• Content-based• Aggregation / clustering• Personalization

Popular

Short tail

Long tail

Browsing share Number of items

Long tail Short tail Popular

Page 39: Telecom datascience master_public

39

Recommendations Offline Evaluation

Pros/Cons

• Convenient way to try new ideas

• Fast and cheap• But hard to align

with online KPI

Approaches

• Rescoring• Prediction game• Business simulator

Page 40: Telecom datascience master_public

40

Public Initiative – Viki Recommendation Challenge

567 submissions from 132 participantshttp://www.dextra.sg/challenges/rakuten-viki-video-challenge

Page 41: Telecom datascience master_public

41

Datascience everywhere !

Rakuten provides marketplaces worldwide

Specific challenges for recommendations

Items catalogue: reinforce statistical power of co-occurrences across shops and services;

Items similarities: find the good parameters for the different use-cases;

Recommendations models: what is the best models for in-shop, all-shops, personalization?

Evaluation: handling long-tail? Comparing different models?

Page 43: Telecom datascience master_public

43

We are Hiring!

Big Data Department – team in Parishttp://global.rakuten.com/corp/careers/bigdata/

http://www.priceminister.com/recrutement/?p=197  

Data Scientist / Software Developer

Build algorithms for recommendations, search, targeting Predictive modeling, machine learning, natural language processing Working close to business Python, Java, Hadoop, Couchbase, Cassandra…

Also hiring: search engine developers, big data system administrators, etc.


Recommended