Semantics-Based News Recommendation with SF-IDF+ International Conference on Web Intelligence,...

transcript

Semantics-Based News Recommendation with SF-IDF+

International Conference on Web Intelligence, Mining, and Semantics (WIMS 2013)

June 13, 2013

Marnix Moerlandmarnix.moerland@gmail.com

Michel Capellemichelcapelle@gmail.com

Frederik Hogenboomfhogenboom@ese.eur.nl

Flavius Frasincarfrasincar@ese.eur.nl

Erasmus University RotterdamPO Box 1738, NL-3000 DRRotterdam, the Netherlands

Introduction (1)

• Recommender systems help users to plough through a massive and increasing amount of information

• Recommender systems:– Content-based– Collaborative filtering– Hybrid

• Content-based systems are often term-based

• Common measure: Term Frequency – Inverse Document Frequency (TF-IDF) as proposed by Salton and Buckley [1988]

Introduction (2)

• One could take into account semantics:– Semantic Similarity (SS) recommenders:

• Jiang & Conrath [1997]• Leacock & Chodorow [1998]• Lin [1998]• Resnik [1995]• Wu & Palmer [1994]

– Concepts instead of terms → Concept Frequency – Inverse Document Frequency (CF-IDF):

• Reduces noise caused by non-meaningful terms• Yields less terms to evaluate• Allows for semantic features, e.g., synonyms• Relies on a domain ontology• Published at WIMS 2011

Introduction (3)

• One could take into account semantics:– Synsets instead of concepts → Synset Frequency – Inverse

Document Frequency (SF-IDF):• Similar to CF-IDF• Does not rely on a domain ontology• Published at WIMS 2012

– Research has shown that relationships like synonymy, hyponymy, … provide structure and contribute to an improved level of interpretability

– Hence, we coin SF-IDF+, which additionally accounts for synset semantic relationships

Introduction (4)

• Implementations in Ceryx (as a plug-in for Hermes [Frasincar et al., 2009], a news processing framework)

• What is the performance of semantic recommenders?– SF-IDF+ vs. SF-IDF– SF-IDF+ vs. TF-IDF– SF-IDF+ vs. SS

Framework: User Profile

• User profile consists of all read news items

• Implicit preference for specific topics

Framework: Preprocessing

• Before recommendations can be made, each news item is parsed:– Tokenizer– Sentence splitter– Lemmatizer– Part-of-Speech

Framework: Synsets

• We make use of the WordNet dictionary and WSD

• Each word has a set of senses and each sense has a set of semantically equivalent synonyms (synsets):– Turkey:

• turkey, Meleagris gallopavo (animal)• Turkey, Republic of Turkey (country)• joker, turkey (annoying person)• turkey, bomb, dud (failure)

– Fly:• fly, aviate, pilot (operate airplane)• flee, fly, take flight (run away)

• Synsets are linked using semantic pointers– Hypernym, hyponym, …

Framework: TF-IDF

• Term Frequency: the occurrence of a term ti in a document dj, i.e.,

• Inverse Document Frequency: the occurrence of a term ti in a set of documents D, i.e.,

• And hence

jiji n

jii dtj

ijiji idftfidftf ,,-

Framework: SF-IDF

• Synset Frequency: the occurrence of a synset si in a document dj, i.e.,

• Inverse Document Frequency: the occurrence of a synset si in a set of documents D, i.e.,

• And hence

jiji n

jii dsj

ijiji idfsfidfsf ,,-

Framework: SF-IDF+

• Synset Frequency: the occurrence of a synset si and its related synsets ri in a document dj, i.e.,

• Inverse Document Frequency: the occurrence of synsets si and ri in a set of documents D, i.e.,

• Weighting is applied depending on relations, and hence

jiji n

|},:{|

jiii drsj

rijirji widfsfidfsf ,,,-

Framework: SS (1)

• TF-IDF and SF-IDF(+) use cosine similarity:– Two vectors:

• User profile items scores• News message items scores

– Measures the cosine of the angle between the vectors

• Semantic Similarity (SS):– Two vectors:

• User profile synsets• News message synsets

– Jiang & Conrath [1997], Resnik [1995] , and Lin [1998]: information content of synsets

– Leacock & Chodorow [1998] and Wu & Palmer [1994]:path length between synsets

Framework: SS (2)

• SS score is calculated by computing the pair-wise similarities between synsets in the unread document u and the user profile r:

where W is a vector with all combinations of synsets from r and u that have a common Part-of-Speech, and where sim(u,r) is any of the mentioned SS measures.

)( ),(

urank Wru

Implementation: Hermes

• Hermes framework is utilized for building a news personalization service for RSS

• Its implementation is the Hermes News Portal (HNP):– Programmed in Java– Uses OWL / SPARQL / Jena / GATE / WordNet

Implementation: Ceryx

• Ceryx is a plug-in for HNP

• Uses WordNet / Stanford POS Tagger / JAWS lemmatizer / Lesk WSD

• Main focus is on recommendation support

• User profiles are constructed

• Computes TF-IDF, SF-IDF, SF-IDF+, and SS

Evaluation (1)

• Experiment:– We let 19 participants evaluate 100 news items– We use 8 different user profiles focusing on various topics– Ceryx computes TF-IDF, SF-IDF, SF-IDF+, and SS for

various cut-off values– F1 scores are evaluated

Evaluation (2)

• Results:

TF-IDFSF-IDF+

Evaluation (2)

• Results:

Conclusions

• Common recommendation is performed using TF-IDF

• Semantics could be considered by considering synsets and their relations

• Semantics-based recommendation outperforms the classic term-based recommendation

• Future work:– Employ also the similarity of words (e.g., named entities)

missing from WordNet (e.g., based on the Google Distance)– Compare SF-IDF, SF-IDF+, and SS with LDA (latent dirichlet

allocation) and ESA (explicit semantic analysis)

Questions

Semantics-Based News Recommendation with SF-IDF+ International Conference on Web Intelligence,...

Documents