Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | julianna-ramsey |
View: | 215 times |
Download: | 1 times |
Applying Semantic Analyses to Content-based Recommendation and Document Clustering
Eric Rozell, MRC InternRensselaer Polytechnic Institute
2
Bio
• Graduate Student @ Rensselaer Polytechnic Institute• Research Assistant @ Tetherless World Constellation• Student Fellow @ Federation of Earth Science
Informatics Partners• Research Advisor: Peter Fox• Research Focus: Semantic eScience• Contact: [email protected]
3
Outline• Background• Semantic Analysis
– Probase Conceptualization– Explicit Semantic Analysis– Latent Dirichlet Allocation
• Recommendation Experiment– Recommendation Systems– Experiment Setup– Results
• Clustering Experiment– Problem– K-Means– Results
• Conclusions
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
4
Background
• Billions of documents on the Web• Semi-structured data from Web 2.0 (e.g., tags,
microformats)• Most knowledge remains in unstructured text• Many natural language techniques for:– Ontology extraction– Topic extraction– Named entity recognition/disambiguation
• Some techniques are better than others for various information retrieval tasks…
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
5
Probase
• Developed at Microsoft Research Asia• Probabilistic knowledge base built from Bing
index and query logs (and other sources)• Text mining patterns– Namely, Hearst patterns: “… artists such as Picaso”• Evidence for hypernym(artists, Picaso)
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
6
ProbaseSe
man
tic A
naly
sis
Reco
mm
enda
tion
Clus
terin
gCo
nclu
sion
sBa
ckgr
ound
7
Probase
• Very capable at conceptualizing groups of entities:– “China; India; United States” yields “country”– “China; India; Brazil; Russia” yields “emerging market”
• Differentiates attributes and entities– “birthday” -> “person” as attribute– “birthday” -> “occasion” as entity
• Applications– Clustering Tweets from Concepts [Song et al., 2011]– Understanding Web Tables– Query Expansion (Topic Search)
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
8
Research Questions
• What’s the best way of extracting concepts from text?– Compare techniques for semantic analysis
• How are extracted concepts useful?– Generate data about where semantic analysis techniques are
applicable• Are user ratings affected by the concepts in media items
such as movies?– Test semantic analysis techniques in recommender systems
• How useful is Web-scale domain knowledge in narrower domains for information retrieval?– Identify need for domain specific knowledge
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
9
Semantic Analysis
• Generating meaning (concepts) from text• Specifically, get prevalent hypernyms
– E.g., “… Apple, IBM, and Microsoft …”– “technology companies”
• Semantic analysis using external knowledge– Probase Conceptualization– Explicit Semantic Analysis– WordNet Synsets
• Semantic analysis using latent features– Latent Dirichlet Allocation– Latent Semantic Analysis
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
10
Probase ConceptualizationSe
man
tic A
naly
sis
Reco
mm
enda
tion
Clus
terin
gCo
nclu
sion
sBa
ckgr
ound
t1
t2
t3
t4...
This is some plain text.
Probase
c1
c2
c3
c4...
Naïve Bayes / Summation
Document Corpus
For each document…
c1 c2 c3 c4
. . .
c1 c2 c3 c4
. . .
c1 c2 c3 c4
. . .
c1 c2 c3 c4
. . .
. . .
Inverse Document
Frequency / Filtering
Document Concepts
11
Probase Conceptualization
• “Cowboy doll Woody (Tom Hanks) is co ordinating a reconnaissance mission to find out what presents his owner Andy is getting for his birthday party days before they move to a new house. Unfortunately for Woody, Andy receives a new spaceman toy, Buzz Lightyear (Tim Allen) who impresses the other toys and Andy, who starts to like Buzz more than Woody. Buzz thinks that he is an actual space ranger, not a toy, and thinks that Woody is interfering with his "mission" to return to his home planet…”
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
Text Source: Internet Movie Database (IMDb)
12
Sample Features for “Toy Story” (Probase)
• dvd encryptions 0.050 “RC”• duty free item 0.044 “toys”• generic word 0.043 “they, travel, it,…”• satellite mission0.032 “reconnaissance mission”• creator-owned work 0.020 “Woody”• amazing song 0.013 “fury”• doubtful word 0.013 “overcome”• ill-fated tool 0.013 “Buzz”• lovable ``toy story'' character 0.011 “Buzz Lightyear, Woody,
…”• pleased star 0.010 “Woody”• trail builder 0.010 “Woody”
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
13
Explicit Semantic Analysis
Image Source: Gabrilovich et al., 2007
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
14
Sample Features for “Toy Story” (ESA)
• #REDIRECT [[Buzz!]] 0.034• #REDIRECT [[The Buzz]] 0.028• #REDIRECT [[Buzz (comics)]] 0.027• #REDIRECT [[Buzz cut]] 0.027• #REDIRECT [[Buzz (DC Thomson)]] 0.024• #REDIRECT [[Buzz Out Loud]] 0.024• #REDIRECT [[The Daily Buzz]] 0.023• #REDIRECT [[Buzz Aldrin]] 0.022• #REDIRECT [[Buzz cut]] 0.022• #REDIRECT [[Buzzing Tree Frog]] 0.022
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
15
Latent Dirichlet Allocation
• Blei et al., 2003• Unsupervised Learning Method• “Generates” documents from Dirichlet
distributions over words and topics• Topic distributions over documents can be
inferred from corpus
Image Source: Wikipedia
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
16
Recommendation Systems
• Collaborative Filtering– “Customers who purchased X also purchased Y.”
• Content-based– “Because you enjoyed ‘GoldenEye’, you may want
to watch ‘Mission: Impossible’.” • Hybrid– Most modern systems take a hybrid approach.
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
17
Content-based Recommendation
• In GoldenEye/Mission: Impossible example…– Structured item content• Genre – Action/Adventure/Thriller• Tags – Action, Espionage, Adventure
– Unstructured item content• Plot synopses – “helicopter, agent, inflitrate, CIA, …”• Concepts? – “aircraft, intelligence agency, …”
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
18
Recommendation Systems
Collaborative Filtering
Approaches
Structured Content-based
Approaches
Unstructured Content-based
Approaches
Test semantic analysis approaches
here.
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
19
Experiment
Feature Generation
Movie Synopses
from IMDb
Matchbox Recommendation
Platform
Mean Absolute
Error (MAE)
Movie Ratings from MovieLens
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
20
Matchbox
Source: Matchbox API Documentation
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Rela
ted
Wor
kCo
nclu
sion
s
21
Experimental Data• Data: MovieLens Dataset [HetRec ’11]
– 855,598 ratings– 10,197 movies– 2,113 users
• Movie synopses from IMDb (http://www.imdb.com)– Collected synopses for 2,633 movies– With 435,043 ratings– From 2,113 users
• Ratings data:– Scored by half points from 0.5 to 5
• Choose different numbers of movies (200; 1,000; all)• Train on 90% of ratings, test on remaining 10%
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
22
Experimental Data
• Controls– Baseline 1: Only features are user IDs and movie IDs– Baseline 2: User IDs, Movie IDs, Movie Genre– Baseline 3: User IDs, Movie IDs, Movie Tags
• Feature Sets– Term Frequency – Inverse Document Frequency– Latent Dirichlet Allocation– Explicit Semantic Analysis– Probase Conceptualization
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
23
Experimental Setup
• 4 Scenarios: (training: white, testing: black)U
sers
Movies
Use
rsMovies
Use
rs
Movies
Use
rs
Movies
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
24
Results
1 2 3 4 5 6 7 8 9 100.56
0.565
0.57
0.575
0.58
0.585
0.59
0.595
Baseline #1Baseline #2Baseline #3TFIDF NormalizedProbase SumESA
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
25
Results# of Movies All (2,633) 1,000 200
Baseline 1 0.672293 0.71654 0.802044Baseline 2 0.641556 0.683297 0.752745Baseline 3 0.655613 0.68994 0.764369TF-IDF 0.674764 0.706914 0.815245Probase 0.670694 0.715456 0.797196ESA 0.670182 0.714967 0.796787LDA (unfinished) 0.711307 0.790362
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
• testing set contains users and movies not seen in training set• recommendations based on item features alone• small amounts of structured data (e.g., genre) are the most influential
in this scenario
26
Results# of Movies All (2,633) 1,000 200
Baseline 1 0.580087 0.564226 0.577349Baseline 2 0.576183 0.563028 0.576673Baseline 3 0.575398 0.563378 0.572297TF-IDF 0.579906 0.575932 0.588288Probase 0.578889 0.563669 0.578089ESA 0.579798 0.564334 0.577638LDA (unfinished) 0.566639 0.579633
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
• testing set contains users not seen in training set.• lots of collaborative data available (explains comparable performance
in all feature sets)• given extensive collaborative data, item features are marginally
beneficial (in Matchbox)
27
Results# of Movies All (2,633) 1,000 200
Baseline 1 0.672843 0.687586 0.832491Baseline 2 0.639683 0.651141 0.81416Baseline 3 0.652071 0.66492 0.745593TF-IDF 0.672362 0.665116 0.844305Probase 0.670159 0.686235 0.823972ESA 0.670451 0.683594 0.817306LDA (unfinished) 0.684689 0.852056
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
• testing set contains movies not seen in the training set• recommendations based on item features and extensive information
on users “rating model”• small amounts of structured data (e.g., genre) are the most influential
in this scenario (even for long-term users)
28
Results# of Movies All (2,633) 1,000 200
Baseline 1 0.560163 0.564673 0.568706Baseline 2 0.556011 0.556456 0.567598Baseline 3 0.550761 0.561643 0.56445TF-IDF 0.551909 0.558942 0.588288Probase 0.556414 0.558113 0.567332ESA 0.556517 0.55706 0.568174LDA (unfinished) 0.558105 0.568927
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
• testing set contains users and movies seen in the training set• recommendations again are primarily collaborative• given a large corpus of rating data for users and items, item features
are only marginally beneficial
29
Results
Experiment
Baseline 1 0.672293 0.580087 0.672843 0.560163Baseline 2 0.641556 0.576183 0.639683 0.556011Baseline 3 0.655613 0.575398 0.652071 0.550761TF-IDF 0.674764 0.579906 0.672362 0.551909Probase 0.670694 0.578889 0.670159 0.556414ESA 0.670182 0.579798 0.670451 0.556517
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
30
Document Clustering
• Divide a corpus into a specified number of groups
• Useful for information retrieval– Automatically generated topics for search results– Recommendations for similar items/pages– Visualization of search space
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
31
K-Means
1. Start with initial clusters2. Compute means of clusters3. Compare cosine distance of each item to means4. Assign to clusters to based on min. distance5. Repeat from step 2 until convergence
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
32
Experimental Setup
1. Generate features for datasets2. Randomly assign initial clusters3. Run K-Means4. Compute purity and ARI5. Repeat steps 2-4 20 times for mean and
standard deviation
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
33
Experimental Data
• 20 Newsgroups (mini)• 2,000 messages from Usenet
newsgroups• 100 messages per topic• Filter messages for body text• Source: http://
kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html
From sci.electronics …
“A couple of years ago I put together a Tesla circuit which was published in an electronics magazine and could have been the circuit which is referred to here. This one used a flyback transformer from a tv onto which you wound your own primary windings...”
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
34
ResultsSe
man
tic A
naly
sis
Reco
mm
enda
tion
Clus
terin
gCo
nclu
sion
sBa
ckgr
ound
Feature Set Purity ARI Scores
TF-IDF 0.379 ± 0.027 0.199 ± 0.023
Probase Only 0.265 ± 0.013 0.101 ± 0.010
Probase + TF-IDF 0.414 ± 0.034 0.241 ± 0.029
ESA Only 0.204 ± 0.010 0.040 ± 0.004
ESA + TF-IDF 0.389 ± 0.036 0.211 ± 0.032
LDA Only N/A N/A
LDA + TF-IDF N/A N/A
35
Results Comparison
• Song et al. Tweets Clustering– Experiment #2: Subtle Cluster Distinctions– Used Tweets about NA, Asia, Africa and Europe– Comparable performance for ESA and Probase
Conceptualization• Hotho et al. WordNet Clustering– Used Reuters dataset and Bisecting K-Means– Found best results for combined TF-IDF and feature sets– Overall improvement from WordNet features was
comparable to Probase features (O[+10%])
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
36
Conclusions
• Semantic analysis features are marginally beneficial in recommendation
• Structured data from limited vocabulary work best for recommending “new items”
• Explicit and latent semantic analysis are comparable in recommendation
• Knowledge bases generated at Web-scale may be too noisy for narrow domain tasks
• Confirmed the efficacy of semantic analysis in document clustering tasks
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
37
Future Directions
• Noise Reduction– Tune the recommender platform for “concepts”– Further explore parameter space for feature
generators– Hybrid Conceptualization / Named Entity
Disambiguation?• Domain-specific knowledge sources– Comparison of Web-scale and domain-specific
resources as external knowledge (e.g., [Aljaber et al., 2010])
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
38
Further Reading
• Short Text Conceptualization Using a Probabilistic Knowledge Base [Song et al., 2011]
• Exploiting Wikipedia as External Knowledge for Document Clustering [Hu et al., 2009]
• Hybrid Recommender Using WordNet “Bag of Synsets” [Degemmis et al., 2007]
• Hybrid Recommender Using LDA [Jin et al., 2005]• Feature Generation for Text Categorization Using World
Knowledge [Gabrilovich and Markovitch, 2005]• WordNet Improves Text Document Clustering [Hotho et al.,
2003]
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
39
Acknowledgements
• David Stern, Ulrich Paquet, Jurgen Van Gael• Haixun Wang, Yangqiu Song, Zhongyuan Wang• Special thanks to Evelyne Viegas!• Microsoft Research Connections
40
References• [Gabrilovich et al., 2007] Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic
relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th international joint conference on Artifical intelligence (IJCAI'07), Rajeev Sangal, Harish Mehta, and R. K. Bagga (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1606-1611.
• [Blei et al., 2003] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (March 2003), 993-1022.
• [Song et al., 2011] Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhu Chen, Short Text Conceptualization using a Probabilistic Knowledgebase, in IJCAI, 2011.
• [Stern et al., 2009] David H. Stern, Ralf Herbrich, and Thore Graepel. 2009. Matchbox: large scale online bayesian recommendations. In Proceedings of the 18th international conference on World wide web (WWW '09). ACM, New York, NY, USA, 111-120.
• [HetRec ’11] Ivan Cantador, Peter Brusilovsky, and Tsvi Kuflik. 2011. 2nd Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011). In Proceedings of the 5th ACM conference on Recommender systems. ACM, New York, NY, USA.
• [Degemmis et al., 2007] Marco Degemmis, Pasquale Lops, and Giovanni Semeraro. A content-collaborative recommender that exploits WordNet-based user profiles for neighborhood formation. User Modeling and User-Adapted Interaction. Vol. 17, Issue 3, 217-255.
41
References• [Jin et al., 2005] Xin Jin, Yanzan Zhou, and Bamshad Mobasher. 2005. A maximum entropy web
recommendation system: combining collaborative and content features. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (KDD '05). ACM, New York, NY, USA, 612-617.
• [Hu et al., 2009] Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou. 2009. Exploiting Wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '09). ACM, New York, NY, USA, 389-396.
• [Gabrilovich and Markovitch, 2005] Evgeniy Gabrilovich and Shaul Markovitch. 2005. Feature generation for text categorization using world knowledge. In Proceedings of the 20th international joint conference on Artifical intelligence (IJCAI'05), 1606-1611.
• [Hotho et al., 2003] Andreas Hotho, Steffen Staab, and Gerd Stumme. 2003. Wordnet improves text document clustering. In Proceedings of the SIGIR 2003 Semantic Web Workshop, 541-544.
• [Aljaber et al., 2010] Bader Aljaber, Nicola Stokes, James Bailey, and Jian Pei. 2010. Document clustering of scientific texts using citation contexts. Information Retrieval. Vol. 13, Issue 2, 101-131.
42
Questions?
• Thanks for attending
43
Appendix
A. Matchbox DetailsB. Implementation DetailsC. Probase Conceptualization DetailsD. Explicit Semantic Analysis DetailsE. Learnings from Probase
44
(Appendix A) Matchbox
• [Stern et al., 2009]• MSR Cambridge recommendation platform• Implements a hybrid recommender using
Infer.NET– Uses combination of expectation propagation (EP)
and variational message passing• Reduces user, item, and context features to
low dimensional trait space
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Rela
ted
Wor
kCo
nclu
sion
s
45
(Appendix A) Matchbox Setup
• Matchbox settings– Use 20 trait dimension (determined
experimentally)– 10 iterations of EP algorithm– Trained on approx. 90% of ratings– Updated model with 75% of ratings per user (in
remaining 10%)– MAE computed for remaining 25% per user
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Rela
ted
Wor
kCo
nclu
sion
s
46
(Appendix B) Implementation
• ESA: https://github.com/faraday/wikiprep-esa• LDA: Infer.NET• Probase: Probase Package v. 0.18• TF-IDF: http://www.codeproject.com/KB/cs/tfidf.aspx• Matchbox: http://codebox/matchbox
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Rela
ted
Wor
kCo
nclu
sion
s
47
(Appendix C) Probase Conceptualization
1. Identify all Probase terms in text2. Use Noisy-or Model to combine:– Concepts from tl as attribute (zl = 1)
– Concepts from tl as entity/concept (zl = 0)
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
48
(Appendix C) Probase Conceptualization
3. Weight terms based on occurrencea. Naïve Bayes (similar to Song et al., 2010)
• Compute P(c|t) for individual terms and use Naïve Bayes model to derive concepts
• Penalizes false positives, does not reward true positives• Generates very small probabilities for large numbers of terms
b. Weighted Sum (similar to Gabrilovich et al., 2007)• Compute P(c|t) for individual terms and compute sum over
document for each concept• Rewards true positives, does not penalize false positives
(accurate concepts and inaccurate concepts, resp.)
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
49
(Appendix C) Probase Conceptualization
4. Penalize frequent concepts– Stop word (concepts) are domain-independent– For films, many domain-specific stop concepts
• E.g., “movie”, “character”, “actor”, etc.
– Inverse Document Frequency on concepts penalizes those that are too frequent
– Also rewards those that are too infrequent (in only one document)
– Solution: Filter for minimum and maximum occurrence
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Conc
lusi
ons
Back
grou
nd
50
(Appendix C) Probase Conceptualization
• Using Summation (similar to Wikipedia ESA)
• Using Naïve Bayes from Song et al. approach– P(|T) P(T|)P()/P(T)
/ P()L - 1
• Inverse Document Frequency for concepts– IDF(ck) = log ( # of documents / document frequency of ck )– Minimum occurrence = 2– Maximum occurrence = 0.5 * # of documents
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Rela
ted
Wor
kCo
nclu
sion
s
51
(Appendix D) Explicit Semantic Analysis
• Gabrilovich et al., 2007• Builds inverted index of Wikipedia content• Input text converted to weight vector of
concepts based on TF-IDF
Sem
antic
Ana
lysi
sRe
com
men
datio
nCl
uste
ring
Rela
ted
Wor
kCo
nclu
sion
s
52
(Appendix E) Learnings from Probase
• Conceptualization works wonders for small numbers of entities
• Would be extremely useful in a large-scale QA environment with many semantic analysis and ML algorithms (e.g., Watson)
• A noisy source of knowledge is best suited to noise-tolerant IR applications
• Still being developed and improving!– Working on recognizing verbs