+ All Categories
Home > Documents > All The Cool Kids, How Do They Fit In? Popularity and...

All The Cool Kids, How Do They Fit In? Popularity and...

Date post: 06-Jun-2018
Category:
Upload: phamliem
View: 213 times
Download: 0 times
Share this document with a friend
15
Proceedings of Machine Learning Research 81:115, 2018 Conference on Fairness, Accountability, and Transparency All The Cool Kids, How Do They Fit In? Popularity and Demographic Biases in Recommender Evaluation and Effectiveness *† Michael D. Ekstrand [email protected] Mucun Tian [email protected] Ion Madrazo Azpiazu [email protected] Jennifer D. Ekstrand [email protected] Oghenemaro Anuyah [email protected] David McNeill [email protected] Maria Soledad Pera [email protected] People and Information Research Team, Dept. of Computer Science, Boise State University Editors: Sorelle A. Friedler and Christo Wilson Abstract In the research literature, evaluations of recommender system effectiveness typically report results over a given data set, pro- viding an aggregate measure of effective- ness over each instance (e.g. user) in the data set. Recent advances in information retrieval evaluation, however, demonstrate the importance of considering the distribu- tion of effectiveness across diverse groups of varying sizes. For example, do users of different ages or genders obtain simi- lar utility from the system, particularly if their group is a relatively small subset of the user base? We apply this consider- ation to recommender systems, using of- fline evaluation and a utility-based metric of recommendation effectiveness to explore whether different user demographic groups experience similar recommendation accu- racy. We find demographic differences in measured recommender effectiveness across two data sets containing different types of feedback in different domains; these differ- ences sometimes, but not always, correlate with the size of the user group in question. Demographic effects also have a complex— and likely detrimental—interaction with popularity bias, a known deficiency of recommender evaluation. These results demonstrate the need for recommender * This paper can be reproduced with scripts available at https://dx.doi.org/10.18122/B2GM6F. This paper is an extension of the poster by Ekstrand and Pera (2017). system evaluation protocols that explicitly quantify the degree to which the system is meeting the information needs of all its users, as well as the need for researchers and operators to move beyond na¨ ıve eval- uations that favor the needs of larger sub- sets of the user population while ignoring smaller subsets. Keywords: recommender systems, fair evaluation 1. Introduction Recommender systems are algorithmic tools for identifying items (e.g., products or services) of in- terest to users (Adomavicius and Tuzhilin, 2005; Ekstrand et al., 2010; Ricci et al., 2015). They are usually deployed to help mitigate information overload (Resnick et al., 1994). Internet-scale item spaces offer many more choices than hu- mans can process, diminishing the quality of their decision-making abilities (Toffler, 1990; Gross, 1964). Recommender systems alleviate this prob- lem by allowing users to more quickly focus on items likely to match their particular tastes. They are deployed across the modern Internet, suggesting products in e-commerce sites, movies and music in streaming media platforms, new connections on social networks, and many more types of items. We are concerned with the fairness of recom- mender systems, a surprisingly tricky concept to define. In addition to the numerous types and op- c 2018 M.D. Ekstrand, M. Tian, I. Madrazo Azpiazu, J.D. Ekstrand, O. Anuyah, D. McNeill & M.S. Pera.
Transcript

Proceedings of Machine Learning Research 81:1–15, 2018 Conference on Fairness, Accountability, and Transparency

All The Cool Kids, How Do They Fit In?Popularity and Demographic Biases in Recommender

Evaluation and Effectiveness∗†

Michael D. Ekstrand [email protected]

Mucun Tian [email protected]

Ion Madrazo Azpiazu [email protected]

Jennifer D. Ekstrand [email protected]

Oghenemaro Anuyah [email protected]

David McNeill [email protected]

Maria Soledad Pera [email protected]

People and Information Research Team, Dept. of Computer Science, Boise State University

Editors: Sorelle A. Friedler and Christo Wilson

AbstractIn the research literature, evaluations ofrecommender system effectiveness typicallyreport results over a given data set, pro-viding an aggregate measure of effective-ness over each instance (e.g. user) in thedata set. Recent advances in informationretrieval evaluation, however, demonstratethe importance of considering the distribu-tion of effectiveness across diverse groupsof varying sizes. For example, do usersof different ages or genders obtain simi-lar utility from the system, particularly iftheir group is a relatively small subset ofthe user base? We apply this consider-ation to recommender systems, using of-fline evaluation and a utility-based metricof recommendation effectiveness to explorewhether different user demographic groupsexperience similar recommendation accu-racy. We find demographic differences inmeasured recommender effectiveness acrosstwo data sets containing different types offeedback in different domains; these differ-ences sometimes, but not always, correlatewith the size of the user group in question.Demographic effects also have a complex—and likely detrimental—interaction withpopularity bias, a known deficiency ofrecommender evaluation. These resultsdemonstrate the need for recommender

∗ This paper can be reproduced with scripts availableat https://dx.doi.org/10.18122/B2GM6F.† This paper is an extension of the poster by Ekstrandand Pera (2017).

system evaluation protocols that explicitlyquantify the degree to which the systemis meeting the information needs of all itsusers, as well as the need for researchersand operators to move beyond naıve eval-uations that favor the needs of larger sub-sets of the user population while ignoringsmaller subsets.

Keywords: recommender systems, fairevaluation

1. Introduction

Recommender systems are algorithmic tools foridentifying items (e.g., products or services) of in-terest to users (Adomavicius and Tuzhilin, 2005;Ekstrand et al., 2010; Ricci et al., 2015). Theyare usually deployed to help mitigate informationoverload (Resnick et al., 1994). Internet-scaleitem spaces offer many more choices than hu-mans can process, diminishing the quality of theirdecision-making abilities (Toffler, 1990; Gross,1964). Recommender systems alleviate this prob-lem by allowing users to more quickly focuson items likely to match their particular tastes.They are deployed across the modern Internet,suggesting products in e-commerce sites, moviesand music in streaming media platforms, newconnections on social networks, and many moretypes of items.

We are concerned with the fairness of recom-mender systems, a surprisingly tricky concept todefine. In addition to the numerous types and op-

c© 2018 M.D. Ekstrand, M. Tian, I. Madrazo Azpiazu, J.D. Ekstrand, O. Anuyah, D. McNeill & M.S. Pera.

All the Cool Kids

erationalizations of fairness in the research liter-ature, recommender fairness must identify whichstakeholder groups to consider for fair treatment(Burke, 2017).

Both offline (Herlocker et al., 2004; Shaniand Gunawardana, 2011) and online (Knijnen-burg et al., 2012) evaluations of recommendersystems typically focus on evaluating the sys-tem’s effectiveness in aggregate over the en-tire population of users. While individual usercharacteristics are sometimes taken into ac-count, as in demographic-informed recommen-dation (Pazzani, 1999; Ghazanfar and Prugel-Bennett, 2010), the end evaluation still aggre-gates over all users.

Recent developments in human-centered infor-mation retrieval have incorporated user demo-graphics and characteristics to evaluate searchengines and understand users’ search behavior.Weber and Castillo (2010) use light user informa-tion augmented with census-based demograph-ics to understand who is using a search engine.Mehrotra et al. (2017) follow this trend by mea-suring Bing’s ability to satisfy the informationneeds of different subgroups of its user popula-tion, e.g. assessing whether it meets the needsof grandparents as effectively as those of youngprofessionals.

This attention is necessary because the largestsubgroup of users will tend to dominate over-all statistics. If other subgroups have differentneeds, their satisfaction will carry less weightin the final analysis. This can lead to a mis-guided perception of the performance of the sys-tem and, more importantly, make it more diffi-cult to identify how to better serve specific de-mographic groups.

Our fundamental research question is this: Dodifferent demographic groups obtain different util-ity from the recommender system? This is astarting point for many further questions, suchas whether particular demographic groups needto be better served by recommender systems and,if so, how they can be identified and supportedin their information needs.

To address this question, we present an em-pirical analysis of the effectiveness of collabora-tive filtering recommendation strategies, strati-fied by the gender and age of the users in thedata set. We apply widely-used recommendationtechniques across two domains, musical artists

and movies, using publicly-available data. Wealso explore the effect of rebalancing the data setby gender, the influence of user profile size onrecommendation quality, and the interaction ofdemographic effects with previously documentedbiases in recommender evaluation, all in the con-text of demographically-distributed differences ineffectiveness.

Our work is inspired by that of Mehrotra et al.(2017). We translate the concepts of their anal-ysis from search engines to recommender sys-tems. While our experiment is less sophisticatedthan Mehrotra et al.’s and necessarily limited byour offline experimental setting, it is fully re-producible using widely-distributed public datasets and can be easily adapted to additional al-gorithms, domains, and applications.

2. Background and Related Work

Recommender systems (Adomavicius andTuzhilin, 2005; Ekstrand et al., 2010) are algo-rithmic tools for helping users find items thatthey may wish to purchase or consume. Theyhave substantial influence; the best availablepublic data indicates that recommendationdrives 85% of Netflix video viewing (Gomez-Uribe and Hunt, 2015) and 30% of Amazonpurchases (Linden et al., 2003).

2.1. Recommendation Techniques

There are a variety of families of recommendationalgorithms. Collaborative filters (Ekstrand et al.,2010) mine user-item interaction traces, such aspurchase records, click logs, or user-provided rat-ings of items, to generate recommendations basedon the behavior of other users with similar taste.Content-based filters (Pazzani and Billsus, 2007;Lops et al., 2011) use item content or metadata,such as tags and text, to recommend items withsimilar content to items the user has liked in thepast. Many production systems use a combi-nation of these and other techniques as hybridstrategies to enhance the overall recommendationprocess (Burke, 2002; Bobadilla et al., 2013).

2.2. Recommender System Evaluation

Recommender systems are evaluated in offlinesettings using evaluation protocols derived from

2

All the Cool Kids

information retrieval (Herlocker et al., 2004; Gu-nawardana and Shani, 2009; Bellogin, 2012).These protocols hide a portion of the data andattempt to predict it using the recommenda-tion model, measuring either the model’s abilityto predict withheld ratings (prediction accuracyevaluation) or its ability to recommend withhelditems (top-N evaluation).

Top-N evaluation is widely regarded as thepreferred setting, as it reflects the end goal ofthe recommender system—to recommend itemsthe user will like—more accurately than predict-ing ratings. Offline top-N evaluation, however,has significant known problems. Among these arepopularity bias (Bellogin et al., 2011), where theevaluation protocol gives higher accuracy scoresto algorithms that favor popular items irrespec-tive of their ability to meet user informationneeds, and misclassified decoys (Ekstrand andMahant, 2017; Cremonesi et al., 2010), wherea good recommendation is erroneously penalizedbecause data on user preferences is incomplete.

Online evaluation, commonly using A/B tests(Kohavi et al., 2007) and measuring user responseto recommendation, is the gold standard for effec-tiveness and avoids many of the problems of of-fline evaluation. User studies (Knijnenburg et al.,2012) allow even deeper insight into why users re-spond to recommendations in the way that theydo. This type of study, however, is more expen-sive to conduct (in terms of time, protocols, andresources) than its offline counterpart (Shani andGunawardana, 2011).

2.3. Fairness in Recommender Systems

The recommender system research communityhas long been interested in examining the so-cial dimension of recommendation; the earliestmodern recommender systems were developed ina human-computer interaction setting (Resnicket al., 1994; Hill et al., 1995), and there has beenwork on how they promote diversity or balka-nization (van Alstyne and Brynjolfsson, 2005;Hosanagar et al., 2013).

More recent work has begun to consider ques-tions of fairness in recommendation. Propos-als for fair recommendation methods include pe-nalizing algorithms for disparate distribution ofprediction error (Yao and Huang, 2017), balanc-ing neighborhoods before producing recommen-

dations (Burke et al., 2017), and making recom-mended items independent from protected infor-mation (Kamishima and Akaho, 2017).

Burke (2017) taxonomizes fairness objectivesand methods based on which set of stakehold-ers in the recommender system are being con-sidered, as it is meaningful to consider fairnessamong many groups in a recommender system.In our work, we examine the C-fairness of rec-ommender algorithms: whether or not they treattheir users (consumers) fairly.

2.4. Demographic-Aware Evaluation

Traditionally, demographic information havebeen considered in the past to improve the ef-fectiveness of diverse tasks, from text classifica-tion (Hovy, 2015), to search (Weber and Castillo,2010), and recommendation (Said et al., 2011).Unfortunately, little is known about the effects ofdemographic information when it comes to eval-uation tasks (Langer and Beel, 2014).

Typical evaluations average over all users ordata points, providing a simple aggregate mea-surement of the recommender’s effectiveness.However, user satisfaction in a recommender sys-tem depends on more than accuracy (Herlockeret al., 2004; Langer and Beel, 2014). In fact,Mehrotra et al. (2017) demonstrate that thisnaıve approach to simply aggregate measure-ments masks important differences in how differ-ent groups of users experience the system. Thesystem may be delivering high-quality serviceto one subset of its user group, while anothersmaller group of users receives lower-quality rec-ommendations or search results; the overall met-ric will not reward effort that improves the ex-perience of minority users as much as it rewardsefforts that make things better for those alreadywell-served.

The fundamental thrust of our present work isto translate this idea from the online web searchsetting employed by Mehrotra et al. to offlineevaluation of recommender systems, and examinewhether applying existing algorithms to existingpublic data sets will provide comparable utilityto different groups of users. The discussion pre-sented in this paper expands the initial analysispresented by Ekstrand and Pera (2017).

3

All the Cool Kids

3. Data and Methods

We used the LensKit recommender toolkit (Ek-strand et al., 2011) to build and evaluate severalcollaborative filtering algorithms across multiplepublic data sets with different types of feedbackin multiple product domains.

3.1. Data Sets

While there are many public records of ratings,plays, and other common recommender inputsfor use in research, few of them have the neces-sary user demographic information to assess biasin recommender effectiveness. We have foundthree that have the necessary data: early ver-sions of the MovieLens data (Harper and Kon-stan, 2016) and the two Last.FM data sets col-lected by Celma (2010). Table 1 summarizesthese data sets.

Table 1: Summary of data sets

Datasets Users Items Pairs Density

LFM1K 992 177,023 904,625 0.52%LFM360K 359,347 160,168 17,559,443 0.03%ML1M 6,040 3,706 1,000,209 4.47%

The LFM1K data set contains 19M records of992 users playing songs from 177K artists, gath-ered from the Last.FM audioscrobbler. We ag-gregated this data at the artist level to pro-duce play counts for 904K user-artist pairs. TheLFM360K data set contains the top 50 most-played artists from 360K users along with theirplay counts, covering 160K artists. Both datasets contain gender, age, and sign-up date formany users (Figure 1 shows demographic cover-age).

The ML1M data set contains 1M 5-star rat-ings of 3,900 movies by 6,040 users who joinedMovieLens, a noncommercial movie recommen-dation service operated by the University of Min-nesota, through the year 2000. Each user hasa self-reported age, gender, occupation, and zipcode. Some time after releasing the 1M data set,MovieLens stopped collecting demographic datafrom new users, so the larger recent data sets(10M and 20M) do not contain the data requiredfor our experiment.

3.2. Source Data Distributions

Differences in recommender effectiveness need tobe understood in the context of the demographicdistribution of the underlying data.

9153

101 18 2 1 2

706

21477

147001

85600

207653572 2365 3610

74957

2221103

2096

1193 550 496 380

382

502

108

84930

241642

32775

1709

4331

Age Gender

LFM

1KLF

M360K

ML1M

1−17 18−24 25−34 35−44 45−49 50−55 56+ NA F M NA

0%

20%

40%

60%

0%

20%

40%

60%

0%

20%

40%

60%

Demographic

Pct

. of U

sers

Figure 1: User distribution by demographicgroup. Numbers in bars are the number of usersin that bin.

458 506

760

1019.5

690

1254

503.5613

77

100110

9481 88

64

562.5655.5

738.5

77

102

Age Gender

LFM

1KM

L1M

1−17 18−24 25−34 35−44 45−49 50−55 56+ NA F M NA

0

400

800

1200

0

30

60

90

Demographic

Med

ian

Item

s C

onsu

med

Figure 2: Median items consumed by users ineach demographic group. We omit LFM360Ksince it only contains each user’s top 50 artists.

Figure 1 shows the distribution of each dataset. All three data sets exhibit similar distribu-tions of user genders, with the majority of usersreporting as male; LFM1K is the least imbal-anced. The largest block of ML1M users be-long to the [25-35] group, whereas a plurality ofLFM360K users belong to the [18-24] group; mostLFM1K users did not report their age. Approxi-mately 10% of LFM360K users declined to sharetheir gender while close to 20% declined to sharetheir age. All user records in the ML1M dataset contain full demographic information. Forconsistency among the reported results, we binLast.FM users into the same age groups used inthe ML1M data set throughout.

4

All the Cool Kids

Figure 2 shows user activity levels, as measuredby the number of movies rated or artists played,in each user group. Men are more active thanwomen in both data sets. The activity-age re-lationship in ML1M data almost follows the de-mographic distribution, with those groups thathave more users also having more active users;the small number of users in most age brackets inLFM1K preclude drawing conclusions from age-activity relationships in that data.

3.3. Experimental Protocol

We partitioned each data set with 5-fold cross-validation. Our primary results use LensKit’s de-fault user-based sampling strategy: select 5 testsets of users, and for each user select 5 ratingsto be the test ratings; the rest of those users’ratings, along with all ratings from users not inthat test set, comprise the train set for that testset. For LFM360K, we sampled 5 disjoint setsof 5000 test users (or items) for each test set todecrease compute time. For LFM1K and ML1M,we partitioned the users into 5 disjoint sets.

We also tested Bellogin’s U1R method (Bel-login, 2012) for neutralizing popularity bias; thisworks exactly like the default, except it picks testsets of items instead of users, and it generates adifferent recommendation list for each user-itempair in the test data, with that item as the onlytest item to be found. The idea is that, by hav-ing the same number of test ratings for each item,recommenders that favor popular items can’t winsimply by having popular items be the right an-swer more often than unpopular ones.

3.4. Performance Metrics

We measure recommender effectiveness us-ing Normalized Discounted Cumulative Gain(nDCG) (Jarvelin and Kekalainen, 2002), awidely-accepted measure of the effectiveness of arecommender system. nDCG measures the util-ity that a user is expected to obtain from a rec-ommendation list, based on that user’s estimatedutility for individual items and the position in thelist at which those items were presented. ThenDCG for a recommendation list L generated foruser u is computed with Equation 1:

nDCGL,u =DCGL,u

IDCGu(1)

DCGL,u is defined by Equation 2, where li is thei-th item in list L and µu(li) is user u’s utility foritem li, and IDCGu is computed as DCGL,u, witha list consisting only of the user’s rated items innon-increasing order of utility.

DCGL,u = µu(l1) +

|L|∑i=2

µu(li)

log2i(2)

nDCGL,u quantifies the utility achieved by arecommendation list as a fraction of the totalachievable utility if the recommender could per-fectly identify the user’s most-preferred items.For the ML1M data set, we define µu(li) asthe user’s rating for movie li; for the Last.FMdata sets, we use the number of times the userhas played the artist. Items for which no datais available are assumed to have a utility of 0.Although this has significant conceptual prob-lems (Ekstrand and Mahant, 2017), it is stan-dard practice in recommender systems research,and there is no widely-accepted improvement.

3.5. Algorithms

We employed several classical and widely-usedcollaborative filtering algorithms, as imple-mented by LensKit. We operated each algorithmin both explicit (rating-based) and implicit (con-sumption record) feedback mode.

• Popular (Pop), recommending the mostfrequently-consumed items.

• Mean, recommending the items with thehighest average rating.

• Item-Item (II), an item-based collaborativefilter (Sarwar et al., 2001; Deshpande andKarypis, 2004) using 20 neighbors and co-sine similarity. The explicit-feedback ver-sion normalizes ratings by subtracting itemmeans; the implicit-feedback version re-places the weighted average with a simplesum of similarities.

• User-User (UU), a user-based collaborativefilter (Resnick et al., 1994; Herlocker et al.,2002) configured to use 30 neighbors and co-sine similarity. The explicit-feedback variantuses user-mean normalization for user ratingvectors, and the implicit-feedback variant

5

All the Cool Kids

again replaces weighted averages with sumsof similarities. User-user did not provideeffective recommendations on the Last.FMdata, so we exclude it from that data set’sresults.

• FunkSVD (MF), the popular gradient de-scent matrix factorization technique (Funk,2006; Paterek, 2007) with 40 latent featuresand 150 training iterations per feature.

In the results, each algorithm is tagged withits variant. Algorithms suffixed with ‘-E’ areexplicit-feedback recommenders (applicable onlyto ML); ‘-B’ are implicit-feedback recommendersthat only consider whether an item was rated orplayed, irrespective of the number of plays (bothdata sets); and ‘-C’ are implicit-feedback recom-menders that use the number of times an artistwas played as repeated feedback (LFM1K andLFM360K), log-normalized prior to recommen-dation.

The purpose of this work is not to comparealgorithms, but to compare recommender per-formance across demographic groups. We haveselected these algorithms to provide a represen-tative sample of classical collaborative filteringapproaches.

4. Results

Using the data and methods presented in Section3, we discuss below the results of the experimentsconducted to quantify user satisfaction with pre-sented recommendations among different demo-graphic groups. For doing so, we consider threedifferent perspectives that guide our assessments:(i) analysis based on raw data, i.e., consideringall users in the data sets, (ii) analysis based onuser activity levels, i.e., controlled profile size,and (iii) analysis based on gender-balanced datasets.

4.1. Basic Results

In order to quantify to what extent demograph-ics affect the overall satisfaction obtained by theusers, we conducted an experiment that considersthe performance of traditional recommendationalgorithms for different gender and age groups,respectively.

Figure 3 illustrates the overall satisfactionobtained by each gender group, measured bynDCG, whereas Figure 4 does the same for usersgrouped by age.

II−B II−E Mean−E MF−B MF−E Pop−B UU−B UU−E

ML1M

Any F M

Any F M

Any F M

Any F M

Any F M

Any F M

Any F M

Any F M

0.0

0.1

0.2

0.3

nDC

G

II−B II−C II−CS MF−B MF−C Pop−B Pop−C

LFM

1KLF

M360K

Any F M NA

Any F M NA

Any F M NA

Any F M NA

Any F M NA

Any F M NA

Any F M NA

0.00

0.05

0.10

0.15

0.0

0.1

0.2

0.3

gender

nDC

G

Figure 3: Algorithm performance by gender.Highlighted cell is for the algorithm with the bestoverall performance on that data set.

For each data set’s best-performing algorithm(highlighted), we compared the differences inutility for each demographic group. ML1M andLFM1K have statistically-significant differencesbetween gender groups, and LFM360K has signif-icant differences between age brackets (Kruskal-Wallis p < 0.01 with the Bonferroni correctionfor multiple comparisons).

4.2. Controlling for Profile Size

As seen in Figure 2, different demographic groupshave different activity levels as measured by thenumber of items they have rated or consumed.The size of a user’s profile can be a factor in theirrecommendation utility, given that more itemsprovide a stronger basis for recommendation. Tocontrol for the effect of profile size on user sat-isfaction, we fitted linear models predicting thenDCG using the number of items in the user’sprofile (excluding LFM360K, since it only con-tains each user’s top 50 artists). We used the av-erage nDCG achieved by all algorithms for a par-ticular user as the dependent variable, so we areonly predicting a single metric per user; this cap-tures an overall notion of the ‘difficulty’ of pro-

6

All the Cool Kids

II−B II−E Mean−E MF−B MF−E Pop−B UU−B UU−EM

L1M

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny

0.0

0.1

0.2

0.3

nDC

G

II−B II−C II−CS MF−B MF−C Pop−B Pop−C

LFM

360K

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny NA

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny NA

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny NA

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny NA

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny NA

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny NA

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny NA

0.0

0.1

0.2

0.3

age

nDC

G

Figure 4: Algorithm performance by age. Highlighted cell has highest overall accuracy. We omitLFM1K because most users in that data set lack age data.

ducing effective recommendations for that user.Figure 5 shows the fitted models; we apply a logtransform to the item count and take the squareroot of the nDCG to achieve a better fit. Sur-prisingly, there is a negative relationship betweenuser profile size and recommendation accuracy;the exact cause is unknown, but we suspect thatusers with more items in their profile have al-ready rated the ‘easy’ items, so recommendingfor them is a harder problem.

LFM1K ML1M

100 10000 100 1000

0.1

0.2

0.3

0.4

0.2

0.4

0.6

Consumed Items

nDC

G

Figure 5: Models predicting nDCG with profilesize.

Figure 6 shows the nDCG for each group af-ter removing the effect of user profile size. Wesee that the demographic effects observed in Sec-tion 4.1 remain after this control, indicating ademographic effect of training the models on thedata beyond that explained by user profile size.

LFM1K ML1MII−

B

II−C

II−C

S

MF

−B

MF

−C

Pop

−B

Pop

−C

II−B

II−E

Mea

n−E

MF

−B

MF

−E

Pop

−B

UU

−B

UU

−E

−0.05

0.00

0.05

0.10

0.15

−0.050

−0.025

0.000

0.025

0.050

Algorithm

Con

trol

led

nDC

G

Gender

F

M

Unknown

(a) Corrected utility by gender.

−0.05

0.00

0.05

0.10

0.15

II−B

II−E

Mea

n−E

MF

−B

MF

−E

Pop

−B

UU

−B

UU

−E

Algorithm

Con

trol

led

nDC

G

1−17

18−24

25−34

35−44

45−49

50−55

56+

(b) MovieLens corrected utility by age

Figure 6: Recommendation utility after control-ling for profile size.

4.3. Resampling for Balance

As shown in Figure 1, both ML1M and LFM360Kdata sets include a larger proportion of maleusers, unbalancing the training data. As prepro-cessing data to produce fair training data is oneway to train fair models (Kamiran and Calders,

7

All the Cool Kids

2009), we resampled the ML1M and LFM360Kdata sets to produce gender-balanced versions ofeach and re-trained the algorithms.

We balanced the data sets by identifying userswith known gender information, and randomlysampling without replacement the same numberof female and male users (1500 samples each forthe ML1M data set and 75000 samples each forLFM360K data set).

Figure 7 shows the experiment results on thegender-balanced data sets, and Table 2 shows thenumeric change from the unbalanced experimentfor the best-performing algorithm on each dataset. We repeated the Kruskal-Wallis test on bothsampled ML1M and LFM360K data sets, and itdid not find a statistically significant differencebetween groups on either data set. Resamplingthe data, while reducing recommender accuracyslightly, did not create new gender differences inperformance for LFM360K, and seems to havereduced the difference for ML1M. We are not surethat it went entirely away, as the Kruskal-Wallistest may be overly conservative and does not testdirectly for the elimination of an effect, but itdoes seem to have diminished. Resampling sothat each group has the same number of ratingsmay eliminate the difference.

II−B II−E Mean−E MF−B MF−E Pop−B UU−B UU−E

ML1M

.GB

Any F M

Any F M

Any F M

Any F M

Any F M

Any F M

Any F M

Any F M

0.0

0.1

0.2

0.3

nDC

G

II−B II−C II−CS MF−B MF−C Pop−B Pop−C

LFM

360K.G

B

Any F M

Any F M

Any F M

Any F M

Any F M

Any F M

Any F M

0.0

0.1

0.2

0.3

gender

nDC

G

Figure 7: Algorithm accuracy by gender on bal-anced data sets. Highlighted cell is for the algo-rithm with the best overall performance on thatdata set.

4.4. Reducing Popularity Bias

To reduce the effect of popularity bias, we ran theML1M version of the experiment using Bellogin’s

Table 2: Changes on nDCG observed on balancedvs. raw data on ML1M and LFM360K data sets

Datasets Algorithm Gender nDCGnDCG

(Balanceddata)

RelativeDifference

ML1M UU-B Female 0.337 0.334 1.03%ML1M UU-B Male 0.351 0.344 2.22%ML1M UU-B Any 0.347 0.339 2.54%LFM360K II-CS Female 0.293 0.296 1.11%LFM360K II-CS Male 0.301 0.298 0.76%LFM360K II-CS Any 0.297 0.297 0.06%

U1R protocol, as described in Section 3.3. Sincethis protocol partitions items instead of users,different users may have different numbers of testitems, and the distribution of user demographicsmay differ from the underlying data. Figure 8ashows the distribution of test pairs per user, andFigure 8b shows the demographic distribution ofthe users in the test data. This distribution cor-responds well to the underlying user distribution.

LFM360K ML1M

0 200 400 600 0 500 1000 15000

500

1000

0

1000

2000

3000

4000

Test Items per User

Use

r C

ount

(a) Distribution of test items per user.

Test Pairs Users

LFM

360KM

L1M

F M NA F M NA

0%20%40%60%

0%20%40%60%

Gender

Fra

ctio

n of

Dat

a P

oint

s

(b) Gender distribution of test pairs and test users.

Figure 8: Distribution of test data in ML1M U1Rexperiment.

Figure 9 shows accuracy by demographic groupfor the best algorithm for each data set underthe U1R protocol. The differences on genderare consistent with the basic results in Figure 3.We compared two averaging strategies, averagingacross all user-item pairs by user gender and av-eraging each user’s recommendation results priorto averaging all users with a particular gender,and saw no difference.

Age tells a different story — on the LFM360K,we see a different pattern in the distribution ofaccuracy across ages than we do under the user-

8

All the Cool Kids

based evaluation protocol in Figure 4. It is notclear which provides a more accurate picture, butthis does demonstrate that correcting for one ef-fect (popularity bias) can change the results foranother effect (demographic bias).

II−B on LFM360K UU−B on ML1M

F M NA F M0.00

0.05

0.10

0.00

0.03

0.06

0.09

0.12

Gender

nDC

G

(a) Effectiveness by gender

II−B on LFM360K UU−B on ML1M

1−17

18−

24

25−

34

35−

44

45−

49

50−

55

56+

NA

1−17

18−

24

25−

34

35−

44

45−

49

50−

55

56+

0.00

0.05

0.10

0.00

0.05

0.10

Age

nDC

G

(b) Effectiveness by age

Figure 9: Recommender effectiveness under U1Rprotocol. The best overall algorithm for eachdata set is shown.

5. Discussion and Limitations

Having observed some differences in rec-ommender performance between demographicgroups, we now turn to the implications of ourresults and some of their limitations.

5.1. Implications for RecommenderEvaluation

The existence of differences in measured rec-ommender performance between demographicgroups indicates a need to consider who is ob-taining how much benefit from a recommendersystem. If some users are underserved by therecommender, it may be indicative of an area forimprovement, particularly if that group of usersrepresents a market segment in which the rec-ommender operator would like to expand theirbusiness.

Research and production evaluation of recom-mender systems needs to account for how dif-ferent subsets of the user population should be

weighted. There is not necessarily a one-size-fits-all answer to the question of how to structure anevaluation; it is a decision that needs to be madebased on the values and goals of the business orresearch program. Our methods and results canprovide data to understand the ramifications ofthe decisions made about recommender evalua-tion.

5.2. Interaction with Popularity Bias

Popularity bias (Bellogin et al., 2011) describesthe phenomenon in which offline top-N recom-mender evaluation gives higher scores to algo-rithms that favor popular items. The extent towhich this is a defect in the evaluation — favor-ing popularity irrespective of user preference —versus an actual measurement of the effectivenessof popular recommendations is unclear; it is be-lieved that it represents a significant deviationfrom ‘true’ performance, but the degree of thatdeviation is difficult to quantify.

From first principles, we expect popularity biasto exacerbate demographic biases: the patternsof the largest group of users will dominate the listof most-popular items, so favoring popular rec-ommendations will also favor recommendationsthat are more likely to match the taste of thedominant group of users at the expense of othergroups with different favorite items.

However, our empirical results do not demon-strate that effect in the data we have. Some ofthe demographic differences in recommender ac-curacy that we see, such as the ML1M gender dif-ference, correlate with the size of the user group;others, such as LFM1K gender differences andLFM360K age differences, do not.

It is difficult to generalize about the causesof the differences we have seen from only threedata sets, but it is clear that we need to lookbeyond popularity bias and demographic groupsize to understand the drivers of demographic dif-ferences in recommender performance. The con-sistency of the results across algorithm families,however, suggests some robustness to these ef-fects.

Further, we have observed that applying onetechnique for reducing popularity bias can shiftour measurements of demographic bias. This in-dicates tradeoffs in the measurement of different

9

All the Cool Kids

biases, so that applying the popularity bias re-duction method is not a clearly correct decision.

5.3. User Retention

One of the goals of recommender systems is to en-gage users with the systems themselves, so thatover time, users can benefit, in terms of person-alization, given the availability of explicit prefer-ence data.

Age Gender

1−17 18−24 25−34 35−44 45−49 50−55 56+ F M

0%

20%

40%

60%

Ret

entio

n R

ate

Figure 10: Retention rate for each demographicgroup on the ML1M data set (with 95% Wilsonconfidence intervals).

A way to quantify this engagement is throughretention: do users continue using the system?The ML1M data set includes timestamps for eachrating, allowing us to analyze user activity overtime; we use this to measure retention and exam-ine its relationship to demographic group. We di-vide user rating activity into sessions by consid-ering the user to be starting a new session when-ever there is a gap of at least an hour betweentwo ratings (Halfaker et al., 2014). Figure 10shows the retention rate (the percentage of userswho returned for a second session) for each de-mographic group.

We observe that men have a higher retentionrate than women (p < 0.005); in the ML1Mdata set, the algorithms we tested provide moreaccurate recommendations to men than women.While this by no means demonstrates a causallink — for one thing, we are not testing the samealgorithm and implementation that MovieLensemployed when these users were active — it sug-gests room for further exploration. The link be-tween recommendation quality and user retentionis key to the online testing employed by large-scale recommender system operators such as Net-flix.

5.4. Limitations of Data

Our analysis on the ML1M data set was con-ducted with users’ explicit feedback, the providedratings. While this data set shows that a cer-tain demographic group dominates its counter-parts in providing ratings in the system, it doesnot account for implicit feedback, or the behav-ior of users who watched movies without nec-essarily providing ratings for them. To ensurethat we accounted for the differences in how de-mographic groups prefer to provide feedback, wealso performed an analysis on Last.FM based onthe number of times a song was played. Our re-sults show consistency across the different groupsirrespective of the type of user feedback, i.e., im-plicit or explicit.

While our results highlight the need to con-sider disparate demographic groups when eval-uating recommender systems to better accountfor user satisfaction, the users of MovieLens andLast.FM may not be representative of generalrecommender system users. Both of these sys-tems (particularly at the time the Last.FM datawas collected) appeal to experienced users whocare deeply about their movies and music. Ca-sual users are more likely to use services such asNetflix and Spotify, and may exhibit markedlydifferent behavior and experience different rec-ommender utility than the expert users in thedata sets we examined. Unfortunately, datafrom more widely-used systems with sufficient at-tributes to look for demographic effects is diffi-cult to find. Many widely-used data sets, suchas Amazon.com and Netflix, do not contain userdemographics.

5.5. Limitations of Evaluation Protocol

The fact that our results are in an entirely offlineexperimental setting also introduces limitations.Our data cannot distinguish whether the differ-ences in measured performance are due to ac-tual differences in the recommender’s ability tomeet users’ information needs, or differences inthe evaluation protocol’s effectiveness at measur-ing that ability. While we suspect that they doreflect actual differences in recommender utility,additional study with online evaluation is neededto complement and calibrate these results, as thecorrelation between offline accuracy and online

10

All the Cool Kids

measures of effectiveness is often weak (Rossettiet al., 2016).

A similar concern can be raised for online pro-tocols (Mehrotra et al., 2017), but the closer con-nection between online measures and long-termcustomer value and experience improves their ex-ternal validity. However, even if our observed dif-ferences are due in significant part to limitationsof the evaluation protocol, the result is still in-teresting: biases in the evaluation protocol for oragainst groups of users would impede the devel-opment of fair recommender systems. Even insti-tutions that can carry out online evaluations useoffline protocols to pre-screen algorithms prior tolive deployment, and offline evaluation metricsare the basis for the objective functions in manyrecommender model-training processes.

5.6. Limitations of Algorithm Selection

While we have tested representatives of sev-eral key families of collaborative filtering algo-rithms, there are many types of algorithms thatwe have not considered. Two notable omissionsare content-based filters, which we omitted be-cause only one of our data sets has sufficientdata to support them, and learning-to-rank rec-ommenders, which LensKit does not yet provide.

Our evaluation methodology and open experi-mental scripts make it easy to re-run our analyseson additional algorithms as they become avail-able in the underlying software.

5.7. Choice of Metric

There are many widely-used metrics that canbe used to evaluate recommender systems (Gu-nawardana and Shani, 2009). For clarity andspace, we focus our results in Section 4 on nDCG,because it considers all of a user’s test items andhas a good conceptual mapping to recommen-dation utility. We included several metrics inour experimental runs, however, and they showedsimilar result trends.

Figures 11 and 12 show our key results fromSection 4.1 with the Mean Reciprocal Rank(MRR) metric (Kantor and Voorhees, 1997).MRR measures recommendation accuracy effec-tiveness by taking the reciprocal of the positionof the first relevant suggestion in each user’sranked list recommendations and averages this

II−B II−E Mean−E MF−B MF−E Pop−B UU−B UU−EM

L1M

Any F M

Any F M

Any F M

Any F M

Any F M

Any F M

Any F M

Any F M

0.0

0.1

0.2

0.3

MR

R

II−B II−C II−CS MF−B MF−C Pop−B Pop−C

LFM

1KLF

M360K

Any F M NA

Any F M NA

Any F M NA

Any F M NA

Any F M NA

Any F M NA

Any F M NA

0.00

0.04

0.08

0.12

0.0

0.1

0.2

0.3

gender

MR

R

Figure 11: Algorithm performance based onMean Reciprocal Rank for users grouped by gen-der.

value over all users in the data set. These per-formance trends match those in Figures 3 and4: (i) male users gain better utility from variedrecommendation strategies than female users inML1M and LFM360K, (ii) female users gain bet-ter utility on LFM1k data sets, and (iii) there areage differences that do not map to demographicgroup size. The difference in recommender sys-tem satisfaction among users of different gendersis more prominent when measured by MRR thannDCG. We hypothesize that this is due to the factthat MRR penalizes recommendations that movethe first relevant item (i.e., highly rated items)further down the list more heavily than nDCG,especially in long lists. On the other hand, nDCGconsiders the position of all relevant recommen-dations, along with their value to the user, in-stead of only observing the position of the firstitem. Which metric is a better measurement ofusefulness depends on the precise recommenda-tion task.

5.8. Ethical Considerations

As our work is entirely based on widely-distributed public data and we did not per-form any data linking that might expose ordeanonymize users in the underlying data sets,it does not place MovieLens or Last.FM users atany risk to which they have not already been ex-

11

All the Cool Kids

II−B II−E Mean−E MF−B MF−E Pop−B UU−B UU−EM

L1M

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny

0.0

0.1

0.2

0.3

0.4

MR

R

II−B II−C II−CS MF−B MF−C Pop−B Pop−C

LFM

360K

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny NA

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny NA

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny NA

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny NA

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny NA

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny NA

1−17

18−

2425

−34

35−

4445

−49

50−

5556

+A

ny NA

0.0

0.1

0.2

0.3

age

MR

R

Figure 12: Algorithm performance based on Mean Reciprocal Rank for users grouped by age.

posed for years by the publication of these datasets.

6. Conclusion and Future Work

We set out to consider whether recommender sys-tems produced equal utility for users of differ-ent demographic groups. Using publicly availabledata sets, we compared the utility, as measuredwith nDCG, for users grouped by age and gender.

Regardless of the recommender strategy con-sidered, we found significant differences for thenDCG among demographic groups. Selectingthe best algorithm from the families we tested,ML1M and LFM1K data sets showed statisti-cally significant differences in effectiveness be-tween gender groups while the LFM360K dataset highlighted a significant effect based on userage.

The demographic effect remains when control-ling for the amount of training data available fora user; it is diminished, but may not entirelydisappear, when resampling the underlying datato train the recommender on a gender-balanceddata set.

Notably, the effects in utility did not exclu-sively benefit large groups: we observed more ac-curacy for women on the Last.FM data, despitethe lower representation of female users in therespective Last.FM data sets.

6.1. Future Work for Research

While our analysis focused on whether this ef-fect could be found across a variety of commonrecommendation algorithms, the differences ap-pear to vary from algorithm to algorithm. Thissuggests there is room for considering how differ-ent algorithms respond to evaluation and whatcharacteristics contribute to more uniform utility.Analysis can also be expanded to include morefamilies of algorithms, such as content-based rec-ommendation and learning-to-rank techniques.

Having found this effect with age and gender,we have not yet considered intersectionality: howdoes recommender effectiveness vary with the in-teraction of multiple demographic data? Ouranalysis did not find that smaller groups werealways disadvantaged, so more research shouldbe done to understand why groups are unevenlyadvantaged by recommendation algorithms.

There is also room for this analysis to be re-peated across other item domains. How rec-ommender system utility compares across demo-graphics may be especially interesting for do-mains like real estate, housing, and job recom-mendations, areas with well-documented histori-cal discrimination.

6.2. Future Work for Industry

Given the hazards of publishing data sets whichinclude individual user’s demographic informa-

12

All the Cool Kids

tion, there are limits to the advances academiacan pursue in this work. As it falls to indus-try to consider whether their own recommendersystems provide comparable utility across demo-graphics, so does the responsibility for publish-ing their results. We see the work of Mehrotraet al. (2017) as an exemplary start in this di-rection, although we would like to see additionaldetails provided to ease replication of the resultsfor other system operators.

6.3. Towards Fair Recommendation

Research on the fairness of recommender systemsis just getting started, and there are many im-portant questions to explore. We have focusedon one small corner of the problem: the equity ofrecommender utility as experienced by differentgroups of users. As Burke (2017) shows, there aremany more dimensions to the problem, such asthe equitable treatment of content producers, aswell as the distribution of non-accuracy recom-mendation value like diversity and serendipity.

Acknowledgments

This work was partially supported by NSF grantIIS 15-65936.

References

G Adomavicius and A Tuzhilin. Toward thenext generation of recommender systems: asurvey of the state-of-the-art and possible ex-tensions. IEEE TKDE, 17(6):734–749, 2005.ISSN 1041-4347. doi: 10.1109/TKDE.2005.99. URL http://dx.doi.org/10.1109/TKDE.

2005.99.

Alejandro Bellogin. Performance prediction andevaluation in Recommender Systems: an In-formation Retrieval perspective. PhD thesis,UAM, 2012.

Alejandro Bellogin, Pablo Castells, and IvanCantador. Precision-oriented evaluation of rec-ommender systems: an algorithmic compari-son. In Proc. of ACM RecSys, page 333–336.ACM, 2011. ISBN 9781450306836. doi:10.1145/2043932.2043996. URL http://doi.

acm.org/10.1145/2043932.2043996.

Jesus Bobadilla, Fernando Ortega, Antonio Her-nando, and Abraham Gutierrez. Recommendersystems survey. Knowledge-based systems, 46:109–132, 2013.

Robin Burke. Hybrid recommender systems:Survey and experiments. User modeling anduser-adapted interaction, 12(4):331–370, 2002.

Robin Burke. Multisided fairness for recommen-dation. Computing Research Repository, July2017. URL http://arxiv.org/abs/1707.

00093.

Robin Burke, Nasim Sonboli, Masoud Man-soury, and Aldo Ordonez-Gauger. Balancedneighborhoods for Fairness-Aware collabora-tive recommendation. In FATREC Work-shop on Fairness, Accountability and Trans-parency in Recommender Systems at Rec-Sys, 2017. URL http://scholarworks.

boisestate.edu/fatrec/2017/1/3/.

O. Celma. Music Recommendation and Discoveryin the Long Tail. Springer, 2010.

Paolo Cremonesi, Yehuda Koren, and RobertoTurrin. Performance of recommender algo-rithms on top-n recommendation tasks. InProc. of ACM RecSys, page 39–46. ACM,2010. ISBN 9781605589060. doi: 10.1145/1864708.1864721. URL http://doi.acm.org/

10.1145/1864708.1864721.

Mukund Deshpande and George Karypis. Item-based top-n recommendation algorithms. ACMTOIS, 22(1):143–177, 2004.

Michael Ekstrand, John Riedl, and Joseph AKonstan. Collaborative filtering recom-mender systems. Foundations and Trendsin Human-Computer Interaction, 4(2):81–173,2010. ISSN 1551-3955. doi: 10.1561/1100000009. URL http://dx.doi.org/10.

1561/1100000009.

Michael D Ekstrand and Vaibhav Mahant.Sturgeon and the cool kids: Problemswith Top-N recommender evaluation. InProc. of FLAIRS. AAAI Press, 22 May2017. URL https://md.ekstrandom.net/

research/pubs/sturgeon/.

13

All the Cool Kids

Michael D. Ekstrand and Maria Soledad Pera.The demographics of cool: Popularity and rec-ommender performance for different groups ofusers. In RecSys 2017 Poster Proceedings,2017.

Michael D Ekstrand, Michael Ludwig, Joseph AKonstan, and John T Riedl. Rethinkingthe recommender research ecosystem: repro-ducibility, openness, and lenskit. In Proc. ofACM RecSys, 2011.

Simon Funk. Netflix update: Try thisat home. http://sifter.org/~simon/

journal/20061211.html, December 2006. Ac-cessed: 2010-4-8.

Mustansar Ali Ghazanfar and Adam Prugel-Bennett. A scalable, accurate hybrid recom-mender system. In Proc. of IEEE WKDD,pages 94–98, 2010.

Carlos A Gomez-Uribe and Neil Hunt. The net-flix recommender system: Algorithms, busi-ness value, and innovation. ACM TMIS, 6(4):13:1–13:19, December 2015. doi: 10.1145/2843948.

Bertram Myron Gross. The managing of organi-zations: The administrative struggle, volume 2.[New York]: Free Press of Glencoe, 1964.

Asela Gunawardana and Guy Shani. A surveyof accuracy evaluation metrics of recommenda-tion tasks. Journal of Machine Learning Re-search, 10:2935–2962, December 2009. ISSN1532-4435. URL http://jmlr.org/papers/

v10/gunawardana09a.html.

Aaron Halfaker, Oliver Keyes, Daniel Kluver,Jacob Thebault-Spieker, Tien Nguyen, Ken-neth Shores, Anuradha Uduwage, and MortenWarncke-Wang. User session identificationbased on strong regularities in inter-activitytime. arXiv:1411.2878 [cs], November 2014.URL http://arxiv.org/abs/1411.2878.

F Maxwell Harper and Joseph A Konstan.The movielens datasets: History and con-text. Transactions on Interactive IntelligentSystems, 5(4):19, 2016.

Jon Herlocker, Joseph A Konstan, and JohnRiedl. An empirical analysis of design choices

in neighborhood-based collaborative filteringalgorithms. Information retrieval, 5(4):287–310, 2002.

Jonathan L Herlocker, Joseph A Konstan,Loren G Terveen, and John T Riedl. Evalu-ating collaborative filtering recommender sys-tems. ACM TOIS, 22(1):5–53, 2004.

William Hill, Larry Stead, Mark Rosenstein, andGeorge Furnas. Recommending and evaluatingchoices in a virtual community of use. In Pro-ceedings of the SIGCHI Conference on HumanFactors in Computing Systems, pages 194–201,1995. doi: 10.1145/223904.223929. URL http:

//dx.doi.org/10.1145/223904.223929.

Kartik Hosanagar, Daniel Fleder, Dokyun Lee,and Andreas Buja. Will the global villagefracture into tribes? recommender systemsand their effects on consumer fragmentation.Management Science, 60(4):805–823, Novem-ber 2013. doi: 10.1287/mnsc.2013.1808.

Dirk Hovy. Demographic factors improve classi-fication performance. In ACL (1), pages 752–762, 2015.

Kalervo Jarvelin and Jaana Kekalainen. Cumu-lated gain-based evaluation of IR techniques.ACM TOIS, 20(4):422–446, October 2002. doi:10.1145/582415.582418.

F Kamiran and T Calders. Classifying withoutdiscriminating. In Proc. of 2nd InternationalConference on Computer, Control and Com-munication, pages 1–6, February 2009. doi:10.1109/IC4.2009.4909197.

Toshihiro Kamishima and Shotaro Akaho. Con-siderations on recommendation independencefor a Find-Good-Items task. In Proc. of Work-shop on Fairness, Accountability and Trans-parency in Recommender Systems at Rec-Sys, 2017. URL http://scholarworks.

boisestate.edu/fatrec/2017/1/11/.

Paul B Kantor and Ellen Voorhees. Report onthe TREC-5 confusion track. In The FifthText REtrieval Conference (TREC-5), Octo-ber 1997. URL http://trec.nist.gov/pubs/

trec5/t5_proceedings.html.

14

All the Cool Kids

Bart P Knijnenburg, Martijn C Willemsen, ZenoGantner, Hakan Soncu, and Chris Newell. Ex-plaining the user experience of recommendersystems. User Modeling and User-Adapted In-teraction, 22(4-5):441–504, 2012.

Ron Kohavi, Randal M Henne, and Dan Sommer-field. Practical guide to controlled experimentson the web: listen to your customers not to theHiPPO. In Proc. of ACM KDD, page 959–967,2007. ISBN 9781595936097. doi: 10.1145/1281192.1281295. URL http://portal.acm.

org/citation.cfm?doid=1281192.1281295.

Stefan Langer and Joeran Beel. The compa-rability of recommender system evaluationsand characteristics of docear’s users. In Proc.of Workshop on Recommender Systems Eval-uation: Dimensions and Design (REDD) atACM RecSys, pages 1–6, 2014.

G Linden, B Smith, and J York. Amazon.comrecommendations: item-to-item collaborativefiltering. IEEE Internet Computing, 7(1):76–80, 2003. doi: 10.1109/MIC.2003.1167344.

Pasquale Lops, Marco De Gemmis, and GiovanniSemeraro. Content-based recommender sys-tems: State of the art and trends. In Rec-ommender Systems Handbook, pages 73–105.Springer, 2011.

Rishabh Mehrotra, Ashton Anderson, FernandoDiaz, Amit Sharma, Hanna Wallach, andEmine Yilmaz. Auditing search enginesfor differential satisfaction across demograph-ics. In Proc. of WWW Companion, 2017.ISBN 9781450349147. doi: 10.1145/3041021.3054197. URL https://doi.org/10.1145/

3041021.3054197.

Arkadiusz Paterek. Improving regularized sin-gular value decomposition for collaborative fil-tering. In Proc. of KDD cup and workshop,volume 2007, pages 5–8, 2007.

Michael J Pazzani. A framework for collabora-tive, content-based and demographic filtering.Artificial intelligence review, 13(5-6):393–408,1999.

Michael J Pazzani and Daniel Billsus. Content-based recommendation systems. In The adap-tive web, pages 325–341. Springer, 2007.

Paul Resnick, Neophytos Iacovou, MiteshSuchak, Peter Bergstrom, and John Riedl.Grouplens: an open architecture for collabo-rative filtering of netnews. In Proc. of ACMCSCW, pages 175–186, 1994.

Francesco Ricci, Lior Rokach, Bracha Shapira,and Paul B Kantor. Recommender SystemsHandbook. Springer, 2015.

Marco Rossetti, Fabio Stella, and MarkusZanker. Contrasting offline and online resultswhen evaluating recommendation algorithms.In Proc. of ACM RecSys, page 31–34, 2016.doi: 10.1145/2959100.2959176.

Alan Said, Till Plumbaum, Ernesto W De Luca,and Sahin Albayrak. A comparison of howdemographic data affects recommendation.UMAP, page 7, 2011.

Badrul Sarwar, George Karypis, Joseph Konstan,and John Riedl. Item-based collaborative fil-tering recommendation algorithms. In Proc. ofWWW, pages 285–295. ACM, 2001.

Guy Shani and Asela Gunawardana. Evaluat-ing recommendation systems. In RecommenderSystems Handbook, pages 257–297. Springer,2011.

Alvin Toffler. Future shock. Bantam, 1990.

Marshall van Alstyne and Erik Brynjolfsson.Global village or Cyber-Balkans? modelingand measuring the integration of electroniccommunities. Management Science, 51(6):851–868, June 2005. doi: 10.1287/mnsc.1050.0363.

Ingmar Weber and Carlos Castillo. The demo-graphics of web search. In Proc. of ACM SI-GIR, pages 523–530. ACM, 2010.

Sirui Yao and Bert Huang. Beyond parity: Fair-ness objectives for collaborative filtering. Com-puting Research Repository, May 2017. URLhttp://arxiv.org/abs/1705.08804.

15


Recommended