Date post: | 19-May-2015 |
Category: |
Technology |
Upload: | maccman |
View: | 1,495 times |
Download: | 0 times |
Recommendations in Production
Alex MacCaw
Netflix Prize
Amazon.comFacebookLast.fmStumbleUpon
Google Suggest
iTunes
Rotten Tomatoes
Yelp
Google Search
Chicken or Egg
• Google Reader
• IMDB
Acts As Recommendable
Types of recommendations
• Content Based
• User Based
• Item Based
Programming Collective Intelligence
Has Many Through Relationship
User Book
UserBooks
Has Many Has Many
Has Many Through
Can have score (rating)
User
class User < ActiveRecord::Base has_many :user_books has_many :books, :through => :user_books acts_as_recommendable :books, :through => :user_booksend
Gives you
User#similar_usersUser#recommended_booksBook#similar_books
The algorithms
• Manhattan Distance
• Euclidean distance
• Cosine
• Pearson correlation coefficient
• Jaccard
• Levenshtein
How does it work?
Strategy
• Map data into Euclidean Space
• Calculate similarity
• Use similarities to recommend
The Black Knight
John Tucker Must Die
James 4 5
Jonah 3 2
George 5 3
Alex 4 2
0
1.25
2.50
3.75
5.00
0 1.25 2.50 3.75 5.00
The Black Knight
John Tucker Must Die
0
1.25
2.50
3.75
5.00
0 1.25 2.50 3.75 5.00
The Black Knight
John Tucker Must Die
item id
user id
score
{ 1 => { 1 => 1.0, 2 => 0.0, ... }, ...}
[[1, 0.5554], [2, 0.888], [3, 0.8843], ...]
Problem 1
It was far too slow to calculate on the fly(obvious)
SELECT * FROM "users" WHERE ("users"."id" = 2) SELECT * FROM "books" SELECT * FROM "users" SELECT "user_books".* FROM "user_books" WHERE ("user_books".user_id IN (1,2,3,4,5,6,7,8,9,10)) SELECT * FROM "books" WHERE ("books"."id" IN (11,6,12,7,13,8,14,9,15,1,2,19,20,3,10,4,5)) SELECT * FROM "books" WHERE ("books"."id" IN (20,3,19,6))
All books All user_books
Solution
Cache the dataset
rake recommendations:build
Build offline
SELECT * FROM "user_books" WHERE ("user_books".user_id = 2) SELECT * FROM "books" WHERE ("books"."id" = 5) SELECT * FROM "books" WHERE ("books"."id" = 4) SELECT * FROM "books" WHERE ("books"."id" = 8) SELECT * FROM "books" WHERE ("books"."id" = 7) SELECT * FROM "books" WHERE ("books"."id" = 2) SELECT * FROM "books" WHERE ("books"."id" = 1)
Problem 2
Fetching the dataset took too long since it was so massive
Solution
Split up the cache by item
Rails.cache.write("aar_books_1", scores
)
Problem 3
The dataset was so big it crashed Ruby!
Solution
Get rid of ActiveRecord
Only deal with integers
items = options[:on_class].connection.select_values("SELECT id from #{options[:on_class].table_name}").collect(&:to_i)
Problem 4
It still crashed Ruby!
{ 1 => { 1 => 1.0, 2 => 0.0, ... }, ...}
Solution
Remove unnecessary cruft from dataset
{ 1 => { 1 => 1.0, ... }, ...}
Problem 5
It was too slow
Solution
Re-write the slow bits in C
Details
• RubyInline
• Implemented Pearson
• Monkey patched original Ruby methods
• Very fast
Ruby Object
InlineC = Module.new do inline do |builder| builder.c ' #include <math.h> #include "ruby.h" double c_sim_pearson(VALUE items) {
No Floats :(
InlineC = Module.new do inline do |builder| builder.c ' #include <math.h> #include "ruby.h" double c_sim_pearson(VALUE items) {
Hash Lookup
if (!st_lookup(RHASH(prefs1)->tbl, items_a[i], &prefs1_item_ob)) { prefs1_item = 0.0; } else { prefs1_item = NUM2DBL(prefs1_item_ob); }
Conversion
return num / den;
Design Designs
• Not too many relationships
• Not to many ‘items’
• Similarity matrix for items, not users
Changing data
Scaling Even Further
• K Means clustering
• Split cluster by category
Adding ratingsActiveRecord::Schema.define(:version => 1) do create_table "books", :force => true do |t| t.string "name" t.datetime "created_at" t.datetime "updated_at" end create_table "user_books", :force => true do |t| t.integer "user_id", :null => false t.integer "book_id", :null => false t.integer "rating", :default => 0 end create_table "users", :force => true do |t| t.string "name" t.datetime "created_at" t.datetime "updated_at" endend
class User < ActiveRecord::Base has_many :user_books has_many :books, :through => :user_books acts_as_recommendable :books, :through => :user_books, :score => :ratingend
That’s it
Improvements?
• Better API
• Perform calculations over a cluster (EC2) using Map/Nanite
class AARN < Nanite::Actor expose :sim_pearson def sim_pearson(item1, item2) Optimizations.c_sim_pearson(item1, item2) endend
http://eribium.org/blog
twitter : maccmanemail/jabber: [email protected]
Questions?
http://rubyurl.com/kUpk
http://github.com/maccman/acts_as_recommendable