PAT: Preference-Aware Transfer Learning for Recommendation...

PAT: Preference-Aware Transfer Learning forRecommendation with Heterogeneous Feedback

Feng Lianga, b, c, Wei Daia, b, c, Yunfeng Huanga, b, c, Weike Pana, b, c∗, ZhongMinga, b, c∗

{liangfeng2018, daiwei20171, huangyunfeng2017}@email.szu.edu.cn, {panweike, mingz}@szu.edu.cn

aNational Engineering Laboratory for Big Data System Computing Technology,Shenzhen University, Shenzhen, China

bGuangdong Laboratory of Artificial Intelligence and Digital Economy (SZ),Shenzhen University, Shenzhen, China

cCollege of Computer Science and Software Engineering,Shenzhen University, Shenzhen, China

Liang et al., (SZU) Preference-Aware Transfer (PAT) IJCNN 2020 1 / 30

Introduction

Problem Definition

Rating Prediction with Users’ Heterogeneous Explicit FeedbackInput: A set of grade score records R = {(u, i , rui)} with rui ∈ G asa grade score such as {0.5,1, . . . ,5}, and a set of binary ratingrecords R = {(u, i , rui)} with rui ∈ B = {like,dislike}.Goal: Estimate the grade score of each (user, item) pair in the testdata TE .


Introduction

Motivation

1 TMF exploits the implicit preference context of users from theauxiliary binary data in grade score prediction, but the implicitpreference context of users in the midst of the target data is notexploited.

2 SVD++ and MF-MPC only exploit the preference context in thetarget data in order to model users’ personalized preferences, anddo not consider the binary ratings in the auxiliary data as used inTMF.


Introduction

Our Contributions

1 In order to share knowledge between two different types of datamore sufficiently, we address the problem from a transfer learningperspective, i.e., taking the grade scores as the target data andthe binary ratings as the auxiliary data.

2 Besides the observed explicit feedback of grade scores and binaryratings, we propose to exploit the implicit preference contextbeneath the feedback, which is incorporated into the predictionprocess of users’ grade scores to items.

3 We conduct extensive empirical studies on two large and publicdatasets and find that our PAT performs significantly better thanthe state-of-the-art methods.


Introduction

Notations (1/3)

Table: Some notations and their explanations.

Notation Explanationn number of usersm number of itemsu, u′ ∈ {1, 2, . . . , n} user IDi, j ∈ {1, 2, . . . ,m} item IDG = {0.5, 1, . . . , 5} grade score rangeB = {like, dislike} binary rating rangerui ∈ G grade score of user u to item irui ∈ B binary rating of user u to item iR = {(u, i, rui )} grade score records (training data)R = {(u, i, rui )} binary rating records (training data)p = |R| number of grade scores (training data)p = |R| number of binary ratings (training data)Ig

u , g ∈ G items rated by user u with score g (training data)Pu items liked (w/ positive feedback) by user u (training data)Nu items disliked (w/ negative feedback) by user u (training data)TE = {(u, i, rui )} grade score records in test data


Introduction

Notations (2/3)

Table: Some notations and their explanations (cont.).

Notation Explanationµ ∈ R global average rating valuebu ∈ R user biasbi ∈ R item biasd ∈ R number of latent dimensionsUu·,Wu· ∈ R1×d user-specific latent feature vectorU,W ∈ Rn×d user-specific latent feature matrixVi·,C

pj·,C

nj·,C

oi′·,C

gi′· ∈ R1×d item-specific latent feature vector

V,Cp,Cn,Co,Cg ∈ Rm×d item-specific latent feature matrixrui predicted grade score of user u to item iˆrui predicted binary rating of user u to item i


Introduction

Notations (3/3)

Table: Some notations and their explanations (cont.).

Notation Explanationγ learning rateρ interaction weight between grade scores and binary ratingsα tradeoff parameter on the corresponding regularization termsδO, δG , δp, δn ∈ {0, 1} indicator variable for positive and negative feedbackwp,wn weight on positive and negative feedbackT iteration number in the algorithm


Related Work

Related Work (1/2)

Probabilistic matrix factorization (PMF) [Salakhutdinov and Mnih, 2008] is adominant recommendation model that takes the explicit grade score matrix asinput and outputs the learned low-rank feature vectors of users and items.

Transfer by collective factorization (TCF) [Pan and Yang, 2013] models users’personalized preferences from both grade scores and binary ratings collectivelyby sharing users’ features and items’ features. Notice that when the auxiliarybinary ratings are not considered, TCF is reduced to PMF.

Interaction-Rich transfer by collective factorization (iTCF) [Pan and Ming, 2014]is proposed based on CMF [Singh and Gordon, 2008] which exploits the richinteractions among the user-specific latent features of the target data and theauxiliary data when calculating the gradients of items in the model training stage.Notice that when the rich interactions mentioned above is not exploited, iTCFreduces to CMF.

Transfer by mixed factorization (TMF) [Pan et al., 2016] combines the featurevectors learned from two different types of data in a collective and integrativemanner. Notice that when the like/dislike feedback of users to items are notconsidered in grade score prediction, TMF becomes iTCF.


Related Work

Related Work (2/2)

In SVD++ [Koren, 2008], a user’s estimated score to an item is related toother items that the user rated in the past, which are called preferencecontext of the user. Furthermore, there is no difference among theserated items because whatever scores they are assigned, they are in thesame set, or in other words, their effects are classified into the sameclass, which is a typical example of one-class preference context (OPC).When predicting the unobserved score, the introduction of OPC canprovide a global preference context for the users.

In MF-MPC [Pan and Ming, 2017], on the other side, rated items exceptthe target one of a given user, i.e., preference context, are classified intoseveral clusters in terms of the grade scores, which are namedmulti-class preference context (MPC). Intuitively, MPC is an advancedversion of OPC which not only offers the global preference information ofusers, but also distinguishes the information with different values.


Method

Collective Matrix Factorization (CMF)

In order to jointly model two different types of explicit feedback, i.e., ruiand rui , a state-of-the-art method is proposed to approximate the gradescore and binary rating simultaneously by sharing some latentvariable [Singh and Gordon, 2008],{

rui = Uu·V Ti· + bu + bi + µ

ˆrui = Wu·V Ti·

(1)

where the item-specific latent feature vector Vi· is shared between twofactorization subtasks. However, for the goal of grade score prediction,some implicit preference contexts are missing in the above jointmodeling approach shown in CMF [Singh and Gordon, 2008].


Method

Implicit Preference Context

Mathematically, we may represent the one-class preference context as COu· , thegraded preference context as CGu·, the positive preference context as Cp

u·, and thenegative preference context as Cn

u· asfollows [Koren, 2008, Pan and Ming, 2017, Pan et al., 2016],

COu· = δO1√

|Iu\{i}|

∑i′∈Iu\{i}

Coi′· (2)

CGu· = δG∑g∈G

1√|Ig

u \{i}|

∑i′∈Ig

u\{i}

Cgi′· (3)

Cpu· = δpwp

1√|Pu|

∑j∈Pu

Cpj· (4)

Cnu· = δnwn

1√|Nu|

∑j∈Nu

Cnj· (5)

where δO, δG , δp, δn ∈ {1, 0} are the indicator variables, and wp and wn are the weightson positive feedback and negative feedback, respectively.


Method

Transfer with Implicit Preference Context

With the preference context, we propose to incorporate them into thecollective factorization framework,{

rui = Uu·V Ti· + bu + bi + µ+ (COu· + CGu· + Cp

u· + Cnu·)V T

i·ˆrui = Wu·V T

i·(6)

which will bring two user-specific latent feature vectors of user u and user u′

to be close if they have similar implicit preference context in a similar way tothat of SVD++ [Koren, 2008]. Notice that we incorporate the preferencecontext into the prediction rule of grade scores instead of that of binaryratings because that matches our final goal of grade score prediction ratherthan binary rating prediction.


Method

Overall Prediction Rule of PAT

The following methods are special cases of our PAT

RSVD [Koren, 2008]: {e1, e2}

CMF [Singh and Gordon, 2008]: {e1, e2, e3, e4}

iTCF [Pan and Ming, 2014]: {e1, e2, e3, e4, e5}

TMF [Pan et al., 2016]: {e1, e2, e3, e4, e5, e6, e7}

SVD++ [Koren, 2008], MF-MPC [Pan and Ming, 2017]: {e1, e2, e8}


Method

Objective Function

We then reach an objective function similar to that ofCMF [Singh and Gordon, 2008], iTCF [Pan and Ming, 2014] andTMF [Pan et al., 2016],

minΘ

n∑u=1

m∑i=1

yui fui + λ

n∑u=1

m∑i=1

yui fui (7)

where fui = 12(rui − rui)

2 + α2 ‖Uu·‖2 + α

2 ‖Vi·‖2 + α2 ‖bu‖2 + α

2 ‖bi‖2 +δp

α2∑

j∈Pu‖Cp

j·‖2F + δn

α2∑

j∈Nu‖Cn

j·‖2F + δO

α2∑

i ′∈Iu\{i} ‖Coi ′·‖

2F +

δGα2∑

g∈G∑

i ′∈Igu \{i} ||C

gi ′·||

2F , and

fui = 12(rui − ˜rui)

2 + α2 ‖Wu·‖2 + α

2 ‖Vi·‖2.


Method

Gradients (1/2)

We have the gradients of the model parameters w.r.t. fui as follows,

∇µ = −eui

∇bu = −eui + αbu

∇bi = −eui + αbi

∇Uu· = −euiVi· + αUu·

∇Vi· = −eui (ρUu· + (1− ρ)Wu· + Cpu· + Cn

u· + COu· + CGu·) + αVi·

∇Coi′· = δO(−eui

1√|Iu\{i}|

Vi· + αCoi′·), i

′ ∈ Iu\{i}

∇Cgi′· = δG(−eui

1√|Ig

u \{i}|Vi· + αCg

i′·), i′ ∈ Ig

u \{i},g ∈ G

where eui = (rui − rui ) is the error w.r.t. the target grade score,ρUu· + (1− ρ)Wu· is used to introduce rich interactions [Pan and Ming, 2014]between the user-specific latent features Uu· and Wu·.


Method

Gradients (2/2)

We have the gradients of the model parameters w.r.t. fui as follows,

∇Cpj· = δp(−euiwp

1√|Pu|

Vi· + αCpj·), j ∈ Pu

∇Cnj· = δn(−euiwn

1√|Nu|

Vi· + αCnj·), j ∈ Nu

∇Wu· = λ(−euiVi· + αWu·)

∇Vi· = λ(−eui(ρWu· + (1− ρ)Uu·) + αVi·)

where eui = (rui − ˆrui) is the error w.r.t. the auxiliary binary rating.


Method

Update Rule

We have update rules,

θ = θ − γ∇θ, (8)

where γ is the learning rate, and θ can be µ, bu, bi , Uu·, Vi·, Cpj· , Cn

j·,Co

i ′·, Cgi ′·, Wu·.


Method

Algorithm

Algorithm 1 The algorithm of preference-aware transfer (PAT).

1: for t = 1, . . . ,T do2: for iter = 1, . . . , |R ∪ R| do3: Randomly pick up a record (u, i , rui) or (u, i , rui) from R∪ R.4: Calculate the gradients w.r.t. fui or fui accordingly.5: Update the corresponding model parameters.6: end for7: Decrease the learning rate via γ ← γ × 0.9.8: end for


Method

Discussion

Our transfer learning solution is very generic, which contains several state-of-the-artfactorization-based recommendation methods as special cases, includingRSVD [Koren, 2008], CMF [Singh and Gordon, 2008], iTCF [Pan and Ming, 2014],TMF [Pan et al., 2016], SVD++ [Koren, 2008] and MF-MPC [Pan and Ming, 2017].

From Table 4, we can see that our PAT contains several pluggable components such as thecomponents for the auxiliary binary ratings, the positive and negative preference context,the multiclass preference context, and the interaction between two types of feedback,which shows that our solution is very generic and flexible.

Table: Relationships between our preference-aware transfer (PAT) and other factorization-basedmethods in the perspective of its projection to the graphical model of our PAT shown in Fig. 13.

Algorithm EdgesRSVD [Koren, 2008] {e1, e2}CMF [Singh and Gordon, 2008] {e1, e2, e3, e4}iTCF [Pan and Ming, 2014] {e1, e2, e3, e4, e5}TMF [Pan et al., 2016] {e1, e2, e3, e4, e5, e6, e7}SVD++ [Koren, 2008], MF-MPC [Pan and Ming, 2017] {e1, e2, e8}PAT-OPC, PAT (proposed) {e1, e2, e3, e4, e5, e6, e7, e8}


Experiments

Datasets

We adopt two public datasets used in a previous study about modelingheterogeneous feedback [Pan et al., 2016], i.e, Movielens 10M (denotedas ML10M) and Flixter. The ML10M dataset contains 10,000,054 gradescores from 71,567 users to 10,681 items. The Flixter dataset contains8,196,075 grade scores from 147,612 users to 48,794 items.

For simulating the problem setting with heterogeneous feedback, weprocess each dataset as follows: (i) we randomly split the data into fiveparts with similar size, and (ii) we then take two parts as training datawith grade scores, take another two parts as binary ratings bytransforming grade scores larger than or equal to four to “like” and gradescores less than four to “dislike”, and take the remaining one part as thetest data with grade scores. We repeat this process for five times andobtain five copies of grade score records, binary ratings and test data.The results are averaged over those five copies of data.


Experiments

Baselines

RSVD [Koren, 2008] is a basic matrix factorization method withoutmodeling preference context and auxiliary binary ratings, which is aspecial case of our PAT with edges {e1,e2};

MF-MPC [Pan and Ming, 2017] is a recent advanced matrix factorizationmethod exploiting multiclass preference context beneath the gradescores, which is a special case of our PAT with edges {e1,e2,e8}; and

TMF [Pan et al., 2016] is a recent factorization-based transfer learningmethod incorporating the auxiliary binary ratings, which is a special caseof our PAT with edges {e1,e2,e3,e4,e5,e6,e7}.

Notice that we do not include some other algorithms for the studied problemsuch as collective matrix factorization (CMF) [Singh and Gordon, 2008] withedges {e1,e2,e3,e4}, and interaction-rich transfer by collective facrtorization(iTCF) [Pan and Ming, 2014] with edges {e1,e2,e3,e4,e5} because theyusually perform worse than TMF [Pan et al., 2016].


Experiments

Parameter Configurations

We adhere to the same rules used in TMF [Pan et al., 2016].Specially, we fix the number of latent dimensions d = 20 onML10M and d = 10 on Flixter, respectively, the iteration numberT = 50, the learning rate γ = 0.01, the interaction weight ρ = 0.5,the weight on the auxiliary binary ratings λ = 1, the tradeoffparameter on the regularization terms α = 0.01, and the weight onpositive and negative feedback wp = 2 and wn = 1.


Experiments

Evaluation Metrics

Mean Absolute Error (MAE)

MAE =∑

(u,i,rui )∈TE

|rui − rui |/|TE |

Root Mean Square Error (RMSE)

RMSE =

√ ∑(u,i,rui )∈TE

(rui − rui)2/|TE |

Performance: the smaller the better.


Experiments

Main Results (1/4)

Table: Recommendation performance of our preference-aware transfer (PAT)and other factorization-based methods on ML10M and Flixter, where theresults of RSVD [Koren, 2008] and TMF [Pan et al., 2016] are copiedfrom [Pan et al., 2016]. Notice that we follow the parameter setting inTMF [Pan et al., 2016] and fix α = 0.01 and T = 50 for all the methods, andwp = 2 and wn = 1 for TMF and our PAT. We also include the configurationsin our generic PAT framework for comparative study and reproducibility.

Data Algorithm MAE RMSE Configurations

ML10M

RSVD 0.6438± 0.0011 0.8364± 0.0012 δp = δn = 0, δG = 0, λ = 0, ρ = 1MF-MPC 0.6162± 0.0006 0.8063± 0.0007 δp = δn = 0, δG = 1, λ = 0, ρ = 1TMF 0.6124± 0.0007 0.8005± 0.0008 δp = δn = 1, δG = 0, λ = 1, ρ = 0.5PAT 0.6107± 0.0003 0.7989± 0.0008 δp = δn = 1, δG = 1, λ = 1, ρ = 0.5

Flixter

RSVD 0.6561± 0.0007 0.8814± 0.0010 δp = δn = 0, δG = 0, λ = 0, ρ = 1MF-MPC 0.6383± 0.0004 0.8644± 0.0005 δp = δn = 0, δG = 1, λ = 0, ρ = 1TMF 0.6348± 0.0007 0.8615± 0.0012 δp = δn = 1, δG = 0, λ = 1, ρ = 0.5PAT 0.6332± 0.0006 0.8572± 0.0010 δp = δn = 1, δG = 1, λ = 1, ρ = 0.5


Experiments

Main Results (2/4)

Observations:Our PAT performs significantly better than all the baselinemethods across the two datasets, which shows the effectivenessof our transfer learning solution in modeling users’ heterogeneousfeedback and preference context.

Compared with RSVD and MF-MPC, TMF and our PAT with bothtarget grade scores and auxiliary binary ratings perform better,which showcases the usefulness of the binary ratings.


Experiments

Main Results (3/4)

SVD++

MF-MPC

PAT-OPC PA

T0.6

0.61

0.62

0.63

MAE

SVD++

MF-MPC

PAT-OPC PA

T

0.79

0.8

0.81

0.82

RMSE

SVD++

MF-MPC

PAT-OPC PA

T0.63

0.635

0.64

0.645

0.65

MAE

SVD++

MF-MPC

PAT-OPC PA

T0.85

0.86

0.87

0.88

RMSE

Figure: Recommendation performance of factorization methods with one-class preferencecontext (OPC) and multiclass preference context (MPC), i.e., MF with OPC(SVD++ [Koren, 2008]), MF with MPC (MF-MPC [Pan and Ming, 2017]), reduced version of ourPAT with OPC (i.e., PAT-OPC) and our PAT with MPC (PAT) on ML10M (top) and Flixter (bottom),respectively.


Experiments

Main Results (4/4)

The overall performance ordering is SVD++ < MF-MPC <PAT-OPC < PAT, which clearly showcases the effectiveness of ourpreference-aware transfer learning solution in modeling users’heterogeneous feedback.

For the two methods with OPC, i.e., SVD++ and our PAT-OPC,and the two methods with MPC, i.e., MF-MPC and our PAT, wecan see that integrating the binary rating records always bringsperformance improvement.

...


Conclusions and Future Work

Conclusions

In particular, we take the grade scores as the target data and thelikes/dislikes as the auxiliary data in a transfer learning view, andexploit the implicit preference beneath the target data and theauxiliary data as the preference context, in order to build a moreaccurate and generic recommendation model.

Technically, we find that several recent algorithms can beprojected to be parts of our generic solution PAT as special cases.

Empirically, we obtain very promising results on two large andpublic datasets in comparison with several state-of-the-artmethods.

More importantly, we observe that the empirical results areconsistent with that of the technical framework with differentsubsets of components, i.e., more components leading to betterperformance.Liang et al., (SZU) Preference-Aware Transfer (PAT) IJCNN 2020 28 / 30

Conclusions and Future Work

Future Work

We are interested in further generalizing our generic factorizationframework with deep federatedlearning [Xue et al., 2019, Yang et al., 2019] and ranking-orientedrecommendation [Wu et al., 2018, Pei et al., 2019].


Thank you

Thank you!

We thank the support of National Natural Science Foundation ofChina Nos. 61872249, 61836005 and 61672358.Q & A: If you have any questions and/or suggestions, welcomesending us an email: [email protected].


References

Koren, Y. (2008).Factorization meets the neighborhood: A multifaceted collaborative filtering model.In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages426–434.

Pan, W. and Ming, Z. (2014).Interaction-rich transfer learning for collaborative filtering with heterogeneous user feedbacks.IEEE Intelligent Systems,, 29(6):48–54.

Pan, W. and Ming, Z. (2017).Collaborative recommendation with multiclass preference context.IEEE Intelligent Systems, 32(2):45–51.

Pan, W., Xia, S., Liu, Z., Peng, X., and Ming, Z. (2016).Mixed factorization for collaborative recommendation with heterogeneous explicit feedbacks.Information Sciences, 332:84–93.

Pan, W. and Yang, Q. (2013).Transfer learning in heterogeneous collaborative filtering domains.Artificial Intelligence, 197:39–55.

Pei, C., Zhang, Y., Zhang, Y., Sun, F., Lin, X., Sun, H., Wu, J., Jiang, P., Ge, J., Ou, W., and Pei, D. (2019).Personalized re-ranking for recommendation.In Proceedings of the 13th ACM Conference on Recommender Systems, pages 3–11.

Salakhutdinov, R. and Mnih, A. (2008).Probabilistic matrix factorization.In Annual Conference on Neural Information Processing Systems, pages 1257–1264.

Singh, A. P. and Gordon, G. J. (2008).Relational learning via collective matrix factorization.In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages650–658.


References

Wu, L., Hsieh, C., and Sharpnack, J. (2018).SQL-Rank: A listwise approach to collaborative ranking.In Proceedings of the 35th International Conference on Machine Learning, ICML ’18, pages 5311–5320.

Xue, F., He, X., Wang, X., Xu, J., Liu, K., and Hong, R. (2019).Deep item-based collaborative filtering for top-n recommendation.ACM Transactions on Information Systems, 37(3):33:1–33:25.

Yang, Q., Liu, Y., Chen, T., and Tong, Y. (2019).Federated machine learning: Concept and applications.ACM Transactions on Intelligent Systems and Technology, 10(2):12:1–12:19.


Date post:	05-Aug-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

PAT: Preference-Aware Transfer Learning for Recommendation...

Documents