+ All Categories
Home > Technology > Beyond Search: Statistical Topic Models for Text Analysis

Beyond Search: Statistical Topic Models for Text Analysis

Date post: 13-Apr-2017
Category:
Upload: jun-wang
View: 641 times
Download: 0 times
Share this document with a friend
61
Keynote at SIGIR 2011, July 26, 2011, Beijing, China Beyond Search: Statistical Topic Models for Text Analysis ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign http://www.cs.uiuc.edu/homes/czhai 1
Transcript

PowerPoint Presentation

Beyond Search: Statistical Topic Models for Text Analysis ChengXiang Zhai

Department of Computer ScienceUniversity of Illinois at Urbana-Champaignhttp://www.cs.uiuc.edu/homes/czhai

1

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Search is a means to the end of finishing a task

Decision MakingLearning Task Completion

Information Synthesis & Analysis Search

2Multiple SearchesInformation SynthesisInformation InterpretationPotentially iterate

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Example Task 1: Comparing News ArticlesCommon ThemesVietnam specificAfghan specificIraq specificUnited nationsDeath of people

Vietnam War

Afghan War

Iraq War

CNNFoxBBCBefore 9/11During Iraq warPost-Iraq warUS blogEuropean blogAsian blogWhats in common? Whats unique? 3

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Example Task 2: Compare Customer ReviewsCommon ThemesIBM specificAPPLE specificDELL specificBattery Life..Hard diskSpeed

IBM LaptopReviews

APPLE LaptopReviews

DELL LaptopReviews

Which laptop to buy? 4

Keynote at SIGIR 2011, July 26, 2011, Beijing, China 5Example Task 3: Identify Emerging Research TopicsWhats hot in database research?

Keynote at SIGIR 2011, July 26, 2011, Beijing, China

One Week LaterExample Task 4: Analysis of Topic Diffusion How did a discussion of a topic in blogs spread? 6

Keynote at SIGIR 2011, July 26, 2011, Beijing, China 6

Tom Hanks, who is my favorite movie star act the leading role.protesting... will lose your faith by watching the movie.a good book to past time.... so sick of people making such a big deal about a fiction book

Query=Da Vinci CodeSample Task 5: Opinion Analysis on Blog ArticlesWhat did people like/dislike about Da Vinci Code?7

Keynote at SIGIR 2011, July 26, 2011, Beijing, China QuestionsCan we model all these analysis problems in a general way? Can we solve these problems with a unified approach? How can we bring users into the loop? Yes!Yes!Yes!Solutions: Statistical Topic Models8

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Rest of the talk Overview of Statistical Topic ModelsContextual Probabilistic Latent Semantic Analysis (CPLSA)Text Analysis Enabled by CPLSAFrom Search Engines to Analysis Engines

9

Keynote at SIGIR 2011, July 26, 2011, Beijing, China What is a Statistical LM?A probability distribution over word sequencesp(Today is Wednesday) 0.001p(Today Wednesday is) 0.0000000000001p(The eigenvalue is positive) 0.00001Context/topic dependent! Can also be regarded as a probabilistic mechanism for generating text, thus also called a generative model

10

Keynote at SIGIR 2011, July 26, 2011, Beijing, China The Simplest Language Model(Unigram Model)Generate a piece of text by generating each word independently Thus, p(w1 w2 ... wn)=p(w1)p(w2)p(wn)Parameters: {p(wi)} p(w1)++p(wN)=1 (N is voc. size)Essentially a multinomial distribution over wordsA piece of text can be regarded as a sample drawn according to this word distribution11

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Text Generation with Unigram LM (Unigram) Language Model p(w| )text 0.2mining 0.1assocation 0.01clustering 0.02food 0.00001Topic 1:Text miningfood 0.25nutrition 0.1healthy 0.05diet 0.02Topic 2:Health

Document dText miningpaper

Food nutritionpaper

SamplingGiven , p(d| ) varies according to d 12

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Estimation of Unigram LM(Unigram) Language Model p(w| )=? Documenttext 10mining 5association 3database 3algorithm 2query 1efficient 1text ?mining ?assocation ?database ?query ?

EstimationTotal #words=10010/1005/1003/1003/100

1/100

language model as topic representation?13

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Language Model as Text Representation: Early Work1961: H. P. Luhns early idea of using relative frequency to represent text [Luhn 61]1976: Robertson & Sparck Jones BIR model [Robertson & Sparck Jones 76]1989: Wong & Yaos work on multinomial distribution representation [Wong & Yao 89]14Luhn, H. P (1961) The automatic derivation of information retrieval encodements from machine-readable texts. In A. Kent (Ed.), Information Retrieval and Machine Translation, Vol. 3, Pt 2., pp. 1021-1028. S. Robertson and K. Sparck Jones. (1976). Relevance Weighting of Search Terms. JASIS, 27, 129-146.S. K. M. Wong and Y. Y. Yao (1989), A probability distribution model for information retrieval. Information Processing and Management, 25(1):39--53.

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Language Model as Text Representation:Two Important Milestones in 1998~19991998: Language model for retrieval (i.e., query likelihood scoring [Ponte & Croft 98] (and also independently [ Hiemstra & Kraaij 99])1999: Probabilistic Latent Semantic Analysis (PLSA) [Hofmann 99] 15J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of ACM-SIGIR 1998, pages 275-281. D. Hiemstra and W. Kraaij, Twenty-One at TREC-7: Ad-hoc and Cross-language track, In Proceedings of the Seventh Text REtrieval Conference (TREC-7), 1999.Thomas Hofmann: Probabilistic Latent Semantic Analysis. UAI 1999: 289-296

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Probabilistic Latent Semantic Analysis (PLSA)ipodnanomusicdownloadapple0.150.080.050.020.01

movieharrypotteractressmusic0.100.090.050.040.02

Topic 1Topic 2Apple iPodHarry Potter

Idownloadedthemusicofthemovieharrypottertomyipodnano

ipod 0.15harry 0.0916

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Parameter EstimationMaximizing data likelihood:

Parameter Estimation using EM algorithm

ipodnanomusicdownloadapple0.150.080.050.020.01

movieharrypotteractressmusic0.100.090.050.040.02

Idownloadedthemusicofthemovieharrypottertomyipodnano??????????

Guess the affiliationEstimate the params

Idownloadedthemusicofthemovieharrypottertomyipodnano

Idownloadedthemusicofthemovieharrypottertomyipodnano

IdownloadedthemusicofthemovieharrypottertomyipodnanoPseudo-CountsPrior set by users17

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Context Features of a Document

Weblog Article

Author

Authors Occupation

LocationTime

communities

source

18

Keynote at SIGIR 2011, July 26, 2011, Beijing, China 18Compared with other kinds of data, Weblogs have some interesting special characteristics, which make it interesting to exploit for text mining.

A General View of Context1999200520061998

papers written in 1998WWWSIGIRACLKDDSIGMODpapers written by Bruce Croft Partition of documentsAny combination of context features (metadata) can define a context

19

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Empower PLSA with Context [Mei & Zhai 06] Make topics depend on context variables Text is generated from a contextualized PLSA model (CPLSA)Fitting such a model to text enables a wide range of analysis tasks involving topics and context20Qiaozhu Mei, ChengXiang Zhai, A Mixture Model for Contextual Text Mining, Proceedings of the 2006 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , (KDD'06 ), pages 649-655

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Documentcontext:Time = July 2005Location = TexasAuthor = xxxOccup. = SociologistAge Group = 45+Contextual Probabilistic Latent Semantics Analysis

View1View2View3ThemesgovernmentdonationNew Orleansgovernment 0.3 response 0.2..donate 0.1relief 0.05help 0.02 ..city 0.2new 0.1orleans 0.05 ..

TexasJuly 2005sociologist

Theme coverages:TexasJuly 2005document

Choose a view Choose a Coverage

government

donate

new

Draw a word from i response aid help Orleans Criticism of government response to the hurricane primarily consisted of criticism of its response to The total shut-in oil production from the Gulf of Mexico approximately 24% of the annual production and the shut-in gas production Over seventy countries pledged monetary donations or other assistance. Choose a theme

21

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Comparing News Articles Iraq War (30 articles) vs. Afghan War (26 articles)Cluster 1Cluster 2Cluster 3CommonThemeunited 0.042nations 0.04killed 0.035month 0.032deaths 0.023Iraq Themen 0.03Weapons 0.024Inspections 0.023troops 0.016hoon 0.015sanches 0.012Afghan ThemeNorthern 0.04alliance 0.04kabul 0.03taleban 0.025aid 0.02taleban 0.026rumsfeld 0.02hotel 0.012front 0.011

The common theme indicates that United Nations is involved in both wars

Collection-specific themes indicate different roles of United Nations in the two wars

22

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Spatiotemporal Patterns in Blog ArticlesQuery= Hurricane KatrinaTopics in the results:

Spatiotemporal patterns

23

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Theme Life Cycles (Hurricane Katrina)

city 0.0634orleans 0.0541new 0.0342louisiana 0.0235flood 0.0227evacuate 0.0211storm 0.0177

price 0.0772oil 0.0643gas 0.0454 increase 0.0210product 0.0203fuel 0.0188company 0.0182

Oil PriceNew Orleans24

Keynote at SIGIR 2011, July 26, 2011, Beijing, China 24The upper figure is the life cycles for different themes in Texas. The red line refers to a theme with the top probability words such as price, oil, gas, increase, etc, from which we know that it is talking about oil price. The blue one, on the other hand, talks about events that happened in the city new orleans. In the upper figure, we can see that both themes were getting hot during the first two weeks, and became weaker around the mid September. The theme New Orleans got strong again around the last week of September while the other theme dropped monotonically.

In the bottom figure, which is the life cycles for the same theme New Orleans in different states. We observe that this theme reachesthe highest probability first in Florida and Louisiana, followed by Washington and Texas, consecutively. During early September, this theme drops significantly in Louisiana while still strong in other states. We suppose this is because of the evacuation in Louisiana. Surprisingly, around late September, a re-arising pattern can be observed in most states, which is most significant in Louisiana. Since this is the time period in which Hurricane Rita arrived, we guess that Hurricane Rita has an impact on the discussion of Hurricane Katrina. This is reasonable since peopleare likely to mention the two hurricanes together or make comparisons. We can find more clues to this hypothesis from Hurricane Rita data set.

Theme Snapshots (Hurricane Katrina)

Week4: The theme is again strong along the east coast and the Gulf of Mexico Week3: The theme distributes more uniformly over the statesWeek2: The discussion moves towards the north and westWeek5: The theme fades out in most statesWeek1: The theme is the strongest along the Gulf of Mexico

25

Keynote at SIGIR 2011, July 26, 2011, Beijing, China 25This slide shows the snapshot for theme ``Government Response'' over the first five weeks of Hurricane Katrina. The darker the color is, the hotter the discussion about this theme is. we observe that at the first week of Hurricane Katrina, the theme ``Government Response' is the strongest in the southeast states, especially those along the Gulf of Mexico. In week 2, we can see the pattern that the theme is spreading towards the north and western states because the northern states are getting darker. In week 3, the theme is distributed even more uniformly, which means that it is spreading all over the states. However, in week 4, we observe that the theme converges to east states and southeast coast again. Interestingly, this week happens to overlap with the first week of Hurricane Rita, which may raise the public concern about government response again in those areas. In week 5, the theme becomes weak in most inland states and most of the remaining discussions are along the coasts.

Another interesting observation is that this theme is originally very strong in Louisiana (the one to the right of Texas, ), but dramatically weakened in Louisiana during week 2 and 3, and becomes strong again from the fourth week. Interestingly, Week 2 and 3 are consistent with the time of evacuation in Louisiana.

Theme Life Cycles (KDD Papers)

gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038

marketing 0.0087customer 0.0086model 0.0079business 0.0048

rules 0.0142association 0.0064support 0.0053

26

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Theme Evolution Graph: KDD

TSVM 0.007criteria 0.007classifica tion 0.006linear 0.005

decision 0.006tree 0.006classifier 0.005class 0.005Bayes 0.005Classifica - tion 0.015text 0.013unlabeled 0.012document 0.008labeled 0.008learning 0.007

Informa - tion 0.012web 0.010social 0.008retrieval 0.007distance 0.005networks 0.004

1999

web 0.009classifica tion 0.007features0.006topic 0.005mixture 0.005random 0.006cluster 0.006clustering 0.005variables 0.005topic 0.010mixture 0.008LDA 0.006 semantic 0.005

2000200120022003200427

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Multi-Faceted Sentiment Summary (query=Da Vinci Code)NeutralPositiveNegativeFacet 1:Movie ... Ron Howards selection of Tom Hanks to play Robert Langdon.Tom Hanks stars in the movie,who can be mad at that?But the movie might get delayed, and even killed off if he loses.Directed by: Ron Howard Writing credits: Akiva Goldsman ...Tom Hanks, who is my favorite movie star act the leading role.protesting ... will lose your faith by ... watching the movie.After watching the movie I went online and some research on ...Anybody is interested in it?... so sick of people making such a big deal about a FICTION book and movie.Facet 2:BookI remembered when i first read the book, I finished the book in two days.Awesome book.... so sick of people making such a big deal about a FICTION book and movie.Im reading Da Vinci Code now.So still a good book to past time.This controversy book cause lots conflict in west society.

28

Keynote at SIGIR 2011, July 26, 2011, Beijing, China 28

Separate Theme Sentiment Dynamics

bookreligious beliefs29

Keynote at SIGIR 2011, July 26, 2011, Beijing, China 29

Event Impact Analysis: IR Researchvector 0.0514concept 0.0298extend 0.0297 model 0.0291space 0.0236boolean 0.0151function 0.0123feedback 0.0077xml 0.0678email 0.0197 model 0.0191collect 0.0187judgment 0.0102rank 0.0097subtopic 0.0079probabilist 0.0778model 0.0432logic 0.0404 ir 0.0338boolean 0.0281algebra 0.0200estimate 0.0119weight 0.0111model 0.1687language 0.0753estimate 0.0520 parameter 0.0281distribution 0.0268probable 0.0205smooth 0.0198markov 0.0137likelihood 0.00591998

Publication of the paper A language modeling approach to information retrievalStarting of the TREC conferences

year1992term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173Theme: retrieval modelsSIGIR papers30

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Many Other VariationsLatent Dirichlet Allocation (LDA) [Blei et al. 03]Impose priors on topic choices and word distributionsMake PLSA a generative model Many variants of LDA! In practice, LDA and PLSA variants tend to work equally well for text analysis [Lu et al. 11]

31[Blei et al. 02] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. In T G Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT Press.

Yue Lu, Qiaozhu Mei, ChengXiang Zhai. Investigating Task Performance of Probabilistic Topic Models - An Empirical Study of PLSA and LDA, Information Retrieval, vol. 14, no. 2, April, 2011.

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Other Uses of Topic Models for Text Analysis Topic analysis on social networks [Mei et al. 08]Opinion Integration [Lu & Zhai 08]Latent Aspect Rating Analysis [Wang et al. 10]

Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai. Topic Modeling with Network Regularization, Proceedings of the World Wide Conference 2008 ( WWW'08), pages 101-110.

Yue Lu, ChengXiang Zhai. Opinion Integration Through Semi-supervised Topic Modeling, Proceedings of the World Wide Conference 2008 ( WWW'08), pages 121-130.

Hongning Wang, Yue Lu, ChengXiang Zhai. Latent Aspect Rating Analysis on Review Text Data: A Rating Regression Approach, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'10), pages 115-124, 2010.32

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Topic Modeling + Social Networks:who work together on what?

Authors writing about the same topic form a community Topic Model OnlyTopic Model + Social NetworkSeparation of 3 research communities: IR, ML, Web33

Keynote at SIGIR 2011, July 26, 2011, Beijing, China

Topic Model for Opinion Integration

190,451 posts

4,773,658 results

How to digest all?

34

Keynote at SIGIR 2011, July 26, 2011, Beijing, China 34Why do we need to integrate opinions? Web 2.0 technology has enabled more and more people to freely express their opinions on the Web in various ways such as customer reviews, internet forums, discussion groups, and Weblogs. Web has become an extremely valuable source which covers a wide range of topics and a huge amount of opinions.For instance, given a topic about a presidential candidate hillary clinton, there is a quite comprehensive article in the wikipedia; there are nearly 200,000 posts in facebook and we get over 4 million results from blog search and many many others. But with such a large scale of information source, how could a user to digest all the opinions from different sources

4,773,658 results

Two Kinds of OpinionsExpert opinionsCNET editors reviewWikipedia articleWell-structuredEasy to accessMaybe biasedOutdated soon

190,451 postsOrdinary opinionsForum discussionsBlog articlesRepresent the majorityUp to dateHard to accessfragmental

How to benefit from both?35

Keynote at SIGIR 2011, July 26, 2011, Beijing, China 35Generally, there are two kinds of opinion sources. One kind called expert opinions is opinions expressed in some well-structured relatively complete review typically written by some expert, such as wiki articles and cnet product reviews. The other is ordinary opinions which are fragmental opinions scattering around in blog articles and forums.These two kinds of opinions both have adv and disadv. Apparently expert opinions are very useful since they are well written and easy to access through those service websites. However, they may be biased and could become outdated soon. On contrary, ordinary opinions tend to represent the general opinions of large number of people and get refreshed quickly. But they are fragmental and hard to digest.How can we benefit from both kinds of opinions?

Generate an Integrative Summarycute tiny..thicker..last many hrsdie out sooncould afford itstill expensive

DesignBatteryPrice..

Topic: iPodExpert review with aspectsText collection of ordinary opinions, e.g. Weblogs

Integrated SummaryDesignBattery

PriceiTunes easy to usewarrantybetter to extend..

Review AspectsExtra Aspects

Similar opinionsSupplementaryopinionsInputOutput36

Keynote at SIGIR 2011, July 26, 2011, Beijing, China 36We take the input of a topic, could be a product, a person or an event; an expert review which contains expert opinions on different aspects of the topic; and a text collection of ordinary opinions, which could be weblogs or forums data. In the output, we hope to remain the well written expert review, and we also extract useful info from the collection ordinary opinions. Some of the ordinary opinions are aligned with the aspects in expert reviews; we also want to discover opinions on some extra aspects. We further separate ordinary opinions that are similar to expert opinions from those that supplementary.we are essentially to use the expert review as a template" to mine text data for ordinary opinions and generate a integrated summary to help user best digest the opinions.Note that we taka a broad definition of opinion in this paper. We do not consider the subjectivity or polarity. Theoretically we could apply any sentiment analysis technique after such summary is generated.

MethodsSemi-Supervised Probabilistic Latent Semantic Analysis (PLSA) The aspects extracted from expert reviews serve as clues to define a conjugate prior on topicsMaximum a Posteriori (MAP) estimationRepeated applications of PLSA to integrate and align opinions in blog articles to expert review 37

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Results: Product (iPhone)Opinion Integration with review aspectsReview articleSimilar opinionsSupplementary opinionsYou can make emergency calls, but you can't use any other functionsN/A methods for unlocking the iPhone have emerged on the Internet in the past few weeks, although they involve tinkering with the iPhone hardwarerated battery life of 8 hours talk time, 24 hours of music playback, 7 hours of video playback, and 6 hours on Internet use.iPhone will Feature Up to 8 Hours of Talk Time, 6 Hours of Internet Use, 7 Hours of Video Playback or 24 Hours of Audio Playback Playing relatively high bitrate VGA H.264 videos, our iPhone lasted almost exactly 9 freaking hours of continuous playback with cell and WiFi on (but Bluetooth off).

Unlock/hack iPhone ActivationBatteryConfirm the opinions from the review Additional info under real usage38

Keynote at SIGIR 2011, July 26, 2011, Beijing, China 38Firstly, we show the opin integration with review aspects. The three columns are review, similar, supp. The rows are different aspects. The first aspect shown is about activation of iphone, it talks about the result if you do not activate it. Our method discovered some supp opinoins about unclock or hack iphone. This kind of info is not mentioned at all in the expert review but would be interesting to know. The next aspect is about battery life. We found similar opinions saying iphone could support xxx, it confirms the opinions from review. In addition, we also extracted opinions on battery life under real usage. This is quite useful for a potential buyer.

Results: Product (iPhone)Opinions on extra aspectssupportSupplementary opinions on extra aspects15You may have heard of iASign an iPhone Dev Wiki tool that allows you to activate your phone without going through the iTunes rigamarole. 13Cisco has owned the trademark on the name "iPhone" since 2000, when it acquired InfoGear Technology Corp., which originally registered the name. 13With the imminent availability of Apple's uber cool iPhone, a look at 10 things current smartphones like the Nokia N95 have been able to do for a while and that the iPhone can't currently match...

Another way to activate iPhoneiPhone trademark originally owned by CiscoA better choice for smart phones?39

Keynote at SIGIR 2011, July 26, 2011, Beijing, China 39We also display a few most supported opinions on extra aspects. The first points out another way to activate iphone; the second provides some related info that ; the third recommends nokia n95 saying it is a better choice of smart phones. These opinions on extra aspects are useful for potential buyers, current customers, or just people who want to know more about the hot product.

Results: Product (iPhone)Support statistics for review aspectsPeople care about pricePeople comment a lot about the unique wi-fi featureControversy: activation requires contract with AT&T40

Keynote at SIGIR 2011, July 26, 2011, Beijing, China 40If we sum up the support of representative opinions, we could plot the support stat for review aspects. The colored ones are the three most supported review aspects. Appraently people care a lot about the price. The unique wi-fi feature of iphone received lots of comments. And there are lots of controversy that activation of iphone requires a two-year contract with AT&T.This kind of information could help the product company get to know a high level feedback of their product and potentially support better business decisions.

Latent Aspect Rating Analysis41

How to infer aspect ratings?Value Location Service .. How to infer aspect weights?

Value Location Service ..

Keynote at SIGIR 2011, July 26, 2011, Beijing, China

Solution: Latent Rating Regression ModelReviews + overall ratingsAspect segmentslocation:1amazing:1walk:1anywhere:10.10.70.10.9nice:1accommodating:1smile:1friendliness:1attentiveness:1Term weightsAspect Rating0.00.90.10.3room:1nicely:1appointed:1comfortable:10.60.80.70.80.9

Aspect Segmentation Latent Rating Regression1.31.83.8

Aspect Weight0.20.20.6

Topic model for aspect discovery+42

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Aspect-Based Opinion Summarization

43

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Reviewer Behavior Analysis & Personalized Ranking of Entities

People like cheap hotels because of good value People like expensive hotels because of good service Query: 0.9 value 0.1 others Non-Personalized Personalized 44

Keynote at SIGIR 2011, July 26, 2011, Beijing, China How can we extend a search engine to leverage topic models for text analysis?

How should we extend a search engine to support text analysis in general? 45

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Analysis Engine based on Topic Models46Query

Search Engine

Results

Topic Models

Workspace

Information SynthesisComparisonSummarizationCategorization

Search + Analysis Interface

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Beyond Search: Toward a General Analysis Engine

Decision MakingLearning Task Completion

Information Synthesis & Analysis Search

47

Analysis Engine

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Challenges in Building a General Analysis Engine What is a task and how can we formally model a task? (task vs. intent vs. information needs)How to design a task specification language?How do we design a set of general analysis operators to accommodate many different tasks?What does ranking mean in an analysis engine (ranking terms, documents, topics, operators)? What should the user interface look like?How can we seamlessly integrate search and analysis? How should we evaluate an analysis engine? 48

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Analysis Operators49

Select

Split

Intersect

Union

Topic

Interpret

CommonC1C2

Compare

Ranking

Keynote at SIGIR 2011, July 26, 2011, Beijing, China

49

Examples of Specific OperatorsC={D1, , Dn}; S, S1, S2, , Sk subset of CSelect OperatorQuerying(Q): C SBrowsing: CSSplitCategorization (supervised): C S1, S2, , SkClustering (unsupervised): C S1, S2, , SkInterpretC x SRanking x Si ordered Si

50

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Compound Analysis Operator:Comparison of K Topics51

SelectTopic 1

CompareCommonS1S2

SelectTopic kInterpret

InterpretInterpretInterpret(Compare(Select(T1,C), Select(T2,C),Select(Tk,C)),C)

Keynote at SIGIR 2011, July 26, 2011, Beijing, China

51

Compound Analysis Operator: Split and Compare52

CompareCommonS1S2

Interpret

InterpretInterpretInterpret(Compare(Split(S,k)),C)

Split

Keynote at SIGIR 2011, July 26, 2011, Beijing, China

52

BeeSpace SystemA biological analysis engine53

Filter, Cluster, Summarize, Analyze

Intersection, Difference, Union,

Persistent Workspace

Sarma, M.S., et al. (2011) BeeSpace Navigator: exploratory analysis of gene function using semantic indexing of biological literature. Nucleic Acids Research, 2011, 1-8, doi:10.1093/nar/gkr285.

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Automation-Confidence (AC) Tradeoff 54Automation of taskConfidence in serviceDeliver Actionable KnowledgeReturn Raw Search Results

Goal

Multi-ResolutionInformation Delivery

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Automation-Generality (AG) Tradeoff 55Automation of taskScalability/GeneralityComplete support for special tasksSearch Engine

Goal

Operator-BasedAnalysis Engine

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Automation-Confidence Tradeoff: Dining Analogy56

Serve Raw-Food Need further processing, but flexible for making different dishes Serve Cooked Dishes

Directly useful for a task, But would be worse if its not the right dish

?

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Automation-Generality Tradeoff: Dining Analogy57

Buffet Paradigm Basic Components + Infinite Combination Food Court Paradigm Finite Choices of Complete PackagesWhats the right paradigm? Need both paradigms?

Keynote at SIGIR 2011, July 26, 2011, Beijing, China SummaryStatistical topic models are promising general tools for supporting text analysis Next-generation search engines should go beyond search to seamlessly support text analysis and better help users complete their tasksMany challenges to be solved: Task modeling Task specification languageNew analysis operatorsNew ranking modelsNew interface issuesNew evaluation challengesAutomation-Generality (AG) tradeoff & Automation-Confidence (AC) tradeoff

58

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Looking Ahead59

Text Analysis/Mining

Information Retrieval

Databases & Data Mining Visualization

Natural LanguageProcessing

Keynote at SIGIR 2011, July 26, 2011, Beijing, China AcknowledgmentsCollaborators: Qiaozhu Mei, Yue Lu, Hongning Wang, Jiawei Han, Bruce Schatz, and many othersFunding 60

Keynote at SIGIR 2011, July 26, 2011, Beijing, China Thank You!Questions/Comments?61

Chart46052510

Sheet16052510

Sheet1

Sheet2

Sheet3

Chart27051015

Sheet17051015

Sheet1

Sheet2

Sheet3

Chart36025123

Sheet16025123

Sheet1

Sheet2

Sheet3

Chart27051015

Sheet17051015

Sheet1

Sheet2

Sheet3

Chart20.00433994860.00269666710.01744406520.01087093920.01192432480.00846922010.01428390850.00445491180.0044832870.01435786850.01069746330.01659951190.0071505590.01396061520.005592360.00771458890.0161461470.01046775070.01339298520.00430181540.01193036790.00803973150.00811349050.0100558110.01293241220.01212106310.00410591790.0106458830.008662530.0062665110.0100550880.01370031330.00661465050.00456677110.00841678440.00887125010.00461071550.0105929730.01289833080.00595307580.00560289480.0075288899

Biology DataWeb InformationTime SeriesClassificationAssociation RuleClusteringBussinessTime (year)Normalized Strength of Theme

Sheet123647156103786814248149360.01580074990.00314675950.01044456350.006896090.00522228170.00455275849231182981221832587970.01045811070.00352392860.00204615210.00329657840.00920768440.0251222008691167773998810987115090.0059953080.01007906860.00669041620.00634286210.00860196370.007646189981176130183617613856145630.00556204080.0120854220.00892673210.01256609220.00418869740.00521870491491093181877119013577146010.01020478050.00746524210.02177933020.0128073420.00486268060.0130128073205649920522128918584196670.01042355210.00325418210.0050338130.01042355210.01123709770.0146946662

Sheet1

Sheet2

Sheet32547572217147216792952691272714936134203258362120302111254672368797164648997410142111126318998811509175892571471591209826113013127145631347314219656315561720196124241460114329937519028821323827315217496196671493614936149361493614936149361493614936149368797879787978797879787978797879787971150911509115091150911509115091150911509115091456314563145631456314563145631456314563145631460114601146011460114601146011460114601146011966719667196671966719667196671966719667196670.01700589180.05068291380.01479646490.00475361540.00314675950.01446170330.00528923410.01975093730.01801017680.01523246560.02307604870.02932818010.04115039220.0136410140.03432988520.01261793790.00284187790.00522905540.01424971760.00556086540.00773307850.00842818660.0356242940.01233817010.00964462590.01094795380.027630550.01201675480.00611137820.01764746270.0100940740.01091808010.00824006040.00672938270.01792213140.00892673210.00917745360.00499965760.00972536130.01342373810.03855900280.01061571130.04225737960.01376618040.00657489210.00727106320.01520313220.01906747340.00966085320.01464381960.01083032490.01210148980.01388112070.0077286826

Sheet3

22981377123923721273213312965149361102227343194624318668773387971535494921193028960511049511509153119148128227119235773113326145631271548818918072128468513532146011311507017426213235327510718013196671493614936149361493614936149361493614936149368797879787978797879787978797879787971150911509115091150911509115091150911509115091456314563145631456314563145631456314563145631460114601146011460114601146011460114601146011966719667196671966719667196671966719667196670.01533208360.00542313870.00247723620.00475361540.01600160690.01586770220.01419389390.04900910550.00890465990.01250426280.00250085260.00306922810.0389905650.00215982720.00522905540.02762305330.02114357170.00772990790.01329394390.00469198020.00816752110.0079937440.01033973410.02624033370.00773307850.00521331130.00443131460.0105060770.00817139330.01016274120.00878939780.01558744760.00817139330.0161367850.00528737210.00212868230.00869803440.01054722280.00602698450.01294431890.01232792270.00493116910.00876652280.00315046910.00582151910.00666090410.00762698940.00355926170.00884730770.01332180810.00671175060.01794884830.01398281390.00544058580.01428390850.00433994860.00269666710.01744406520.01087093920.01192432480.01917161760.03868031850.008469220119990.01396061520.00445491180.0044832870.01435786850.01069746330.01659951190.01543612730.02775097890.00715055920000.01193036790.005592360.00771458890.0161461470.01046775070.01339298520.01626086210.00926324240.004301815420010.0106458830.00803973150.00811349050.0100558110.01293241220.01212106310.01111302340.00449929930.004105917920020.00841678440.008662530.0062665110.0100550880.01370031330.00661465050.01466281670.00815056010.004566771120030.00752888990.00887125010.00461071550.0105929730.01289833080.00595307580.01403641880.00936733980.00560289482004

Biology DataWeb InformationTime SeriesClassificationAssociation RuleClusteringBussinessTime (year)Normalized Strength of Theme


Recommended