Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Semantics hidden within co-occurrence patternsA bottom-up approach to the Semantic Web?
Srinath Srinivasa
IIIT [email protected]
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Outline
1 Co-occurrence and Meaning
2 Co-occurrence graphs
3 Interpretation of Co-citations
4 Topical Anchors
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Outline
1 Co-occurrence and Meaning
2 Co-occurrence graphs
3 Interpretation of Co-citations
4 Topical Anchors
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Conventional WebIR and co-occurrence
Lexical feature extraction: Bag-of-words model
Document vectorization
Implicit assumption of independence of dimensions
Vector space reduction and spectral analyses for identifyinghidden semantics (Ex: LSA, SVD, Clustering, etc.)
In human languages, lexical terms are not only not independent ofone another, important semantic structures are inherent in the wayterms co-occur.
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Motivational Problems
Some motivational problems to show limitations of purely lexicalapproaches to IR:
The topical anchor problem
“If ever a player has overshadowed Sachin Tendulkar for sheer class ofbatsmanship, it is V V S Laxman. After a record 353-run fourth-wicketpartnership in the 2004 Sydney Test when Laxman hit 30 fours in his 178to Tendulkar’s 33 in his unbeaten 241, the master put the artistry of V VS in perspective.”
What is the best topic of this paragraph: Sachin Tendulkar, V V SLaxman, Sydney, Australia, Cricket, Test Match
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Motivational Problems
The semantic attributes problem
Given that a user has searched for the term “Malmo” which of the followingkeywords can be termed as “attributes” that enhance the meaning representedby Malmo:
Driving
History
Mileage
Weather
Symptoms
Elephant
LATEX beamer
Infringement
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Motivational Problems
The topical marker problem
The US Federal Aviation Regulations Sec 380.12 states that:
The charter operator may not cancel a charter for any reason (including insufficient participation), exceptfor circumstances that make it physically impossible to perform the charter trip, less than 10 days beforethe scheduled date of departure of the outbound trip.
If the charter operator cancels 10 or more days before the scheduled date of departure, the operator mustso notify each participant in writing within 7 days after the cancellation but in any event not less than 10days before the scheduled departure date of the outbound trip. If a charter is canceled less than 10 daysbefore scheduled departure (i.e., for circumstances that make it physically impossible to perform thecharter trip), the operator must get the message to each participant as soon as possible.
If a user who has booked a ticket with a charter operator finds out that her
flight has been cancelled suddenly without notice and wants to confront the
operator; what should she search for: charter operator, FAR, cancellation,
scheduled trip, Sec 380, operator, notification, . . .
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Motivational Problems
The topical marker problem
The US Federal Aviation Regulations Sec 380.12 states that:
The charter operator may not cancel a charter for any reason (including insufficient participation), exceptfor circumstances that make it physically impossible to perform the charter trip, less than 10 days beforethe scheduled date of departure of the outbound trip.
If the charter operator cancels 10 or more days before the scheduled date of departure, the operator mustso notify each participant in writing within 7 days after the cancellation but in any event not less than 10days before the scheduled departure date of the outbound trip. If a charter is canceled less than 10 daysbefore scheduled departure (i.e., for circumstances that make it physically impossible to perform thecharter trip), the operator must get the message to each participant as soon as possible.
If a user who has booked a ticket with a charter operator finds out that her
flight has been cancelled suddenly without notice and wants to confront the
operator; what should she search for: charter operator, FAR, cancellation,
scheduled trip, Sec 380, operator, notification, . . .
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Motivational Problems
The theme problem:
Article 1
A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport andwas forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with onlyminor injuries.
Article 2
La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey.Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves anumber of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easilyreachable from New Jersey through the Lincoln tunnel that runs under the Hudson river.
Article 3
Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water aftersuffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he wouldnot make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore.
Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC)
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Motivational Problems
The theme problem:
Article 1
A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport andwas forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with onlyminor injuries.
Article 2
La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey.Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves anumber of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easilyreachable from New Jersey through the Lincoln tunnel that runs under the Hudson river.
Article 3
Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water aftersuffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he wouldnot make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore.
Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC)
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Motivational Problems
The theme problem:
Article 1
A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport andwas forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with onlyminor injuries.
Article 2
La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey.Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves anumber of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easilyreachable from New Jersey through the Lincoln tunnel that runs under the Hudson river.
Article 3
Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water aftersuffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he wouldnot make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore.
Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC)
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Motivational Problems
The theme problem:
Article 1
A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport andwas forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with onlyminor injuries.
Article 2
La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey.Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves anumber of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easilyreachable from New Jersey through the Lincoln tunnel that runs under the Hudson river.
Article 3
Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water aftersuffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he wouldnot make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore.
Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC)
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-occurrence and Meaning
Hebbian learning
Co-occurrence plays a central role in the Hebbian theory of the semantic organization of the human mind,which states that synaptic plasticity between neurons are determined by repeated and persistentstimulation of the pre- and post-synaptic cells [2].
This is also summarized as: Cells that fire together, wire together
Co-occurrence and the language instinct
Language structures such as pluralization, is often learnt by analyzing co-occurrence patterns. Aninteresting example is the “wug” test (cf. [5]):That is a pig; these are pigs. That is a dog; these are dogs. That is a cat; these are cats. That is a wug;these are .
The use of co-occurrence is even more apparent in this example, that leads to confusion (even if for amoment):The plural of radius is radii; the plural of thesis is theses; the plural of bus is buses. The plural of lotus islotii? lotes? lotuses?
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-occurrence and Meaning
Hebbian learning
Co-occurrence plays a central role in the Hebbian theory of the semantic organization of the human mind,which states that synaptic plasticity between neurons are determined by repeated and persistentstimulation of the pre- and post-synaptic cells [2].
This is also summarized as: Cells that fire together, wire together
Co-occurrence and the language instinct
Language structures such as pluralization, is often learnt by analyzing co-occurrence patterns. Aninteresting example is the “wug” test (cf. [5]):That is a pig; these are pigs. That is a dog; these are dogs. That is a cat; these are cats. That is a wug;these are .
The use of co-occurrence is even more apparent in this example, that leads to confusion (even if for amoment):The plural of radius is radii; the plural of thesis is theses; the plural of bus is buses. The plural of lotus islotii? lotes? lotuses?
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-occurrence and meaning
Meaning is usage
The analytic philosophy worldview: Meaning is usage [1] can be explained byrepresenting usage as co-occurrence analysis.
Consider the following paragraphs:
Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of thebest mileage for pqers in its category. My pqer can seat five people and is agood candidate for pqer-pooling.
On December 26 2004, a massive earthquake measuring 9.1 jolted Java. Thisearthquake triggered a huge tsunami that has been the deadliest in history. Wehave developed an applet to simulate the path taken by the tsunami. You canrun this applet in any browser that has Java enabled.
In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term
“Java” are both resolved by looking at other terms that co-occur with them.
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-occurrence and meaning
Meaning is usage
The analytic philosophy worldview: Meaning is usage [1] can be explained byrepresenting usage as co-occurrence analysis.
Consider the following paragraphs:
Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of thebest mileage for pqers in its category. My pqer can seat five people and is agood candidate for pqer-pooling.
On December 26 2004, a massive earthquake measuring 9.1 jolted Java. Thisearthquake triggered a huge tsunami that has been the deadliest in history. Wehave developed an applet to simulate the path taken by the tsunami. You canrun this applet in any browser that has Java enabled.
In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term
“Java” are both resolved by looking at other terms that co-occur with them.
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-occurrence and meaning
Meaning is usage
The analytic philosophy worldview: Meaning is usage [1] can be explained byrepresenting usage as co-occurrence analysis.
Consider the following paragraphs:
Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of thebest mileage for pqers in its category. My pqer can seat five people and is agood candidate for pqer-pooling.
On December 26 2004, a massive earthquake measuring 9.1 jolted Java. Thisearthquake triggered a huge tsunami that has been the deadliest in history. Wehave developed an applet to simulate the path taken by the tsunami. You canrun this applet in any browser that has Java enabled.
In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term
“Java” are both resolved by looking at other terms that co-occur with them.
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-occurrence and meaning
Meaning is usage
The analytic philosophy worldview: Meaning is usage [1] can be explained byrepresenting usage as co-occurrence analysis.
Consider the following paragraphs:
Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of thebest mileage for pqers in its category. My pqer can seat five people and is agood candidate for pqer-pooling.
On December 26 2004, a massive earthquake measuring 9.1 jolted Java. Thisearthquake triggered a huge tsunami that has been the deadliest in history. Wehave developed an applet to simulate the path taken by the tsunami. You canrun this applet in any browser that has Java enabled.
In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term
“Java” are both resolved by looking at other terms that co-occur with them.
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Outline
1 Co-occurrence and Meaning
2 Co-occurrence graphs
3 Interpretation of Co-citations
4 Topical Anchors
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Capturing co-occurrence
We are given a document corpus that is represented as a setof “contexts”:
C = {C1, C2, . . . Cn}Depending on the specific problem, a context may takevarious forms like: sentence, paragraph, document, etc.
Two entities ei and ej are said to co-occur (denoted asei � ej ) if there is some context C such that ei , ej ∈ C
The support for a co-occurring pair ei � ej is the probabilityof finding this co-occurrence in any given context C in thecorpus. In other words, the support is the joint probabilityP(ei , ej )
Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we
focus on pairwise co-occurrences and derive higher order semantics when
required.IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Capturing co-occurrence
We are given a document corpus that is represented as a setof “contexts”:
C = {C1, C2, . . . Cn}Depending on the specific problem, a context may takevarious forms like: sentence, paragraph, document, etc.
Two entities ei and ej are said to co-occur (denoted asei � ej ) if there is some context C such that ei , ej ∈ C
The support for a co-occurring pair ei � ej is the probabilityof finding this co-occurrence in any given context C in thecorpus. In other words, the support is the joint probabilityP(ei , ej )
Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we
focus on pairwise co-occurrences and derive higher order semantics when
required.IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Capturing co-occurrence
We are given a document corpus that is represented as a setof “contexts”:
C = {C1, C2, . . . Cn}Depending on the specific problem, a context may takevarious forms like: sentence, paragraph, document, etc.
Two entities ei and ej are said to co-occur (denoted asei � ej ) if there is some context C such that ei , ej ∈ C
The support for a co-occurring pair ei � ej is the probabilityof finding this co-occurrence in any given context C in thecorpus. In other words, the support is the joint probabilityP(ei , ej )
Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we
focus on pairwise co-occurrences and derive higher order semantics when
required.IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-occurrence graphs
Co-occurrence graph
A co-occurrence graph is a weighted, undirected graph G = (E , �, w), whereE is a set of “entities”, �⊆ E × E is a set of co-occurrences, and w :�→ <indicates support for the co-occurrence
Co-occurrence versus n-partite graphs
Semantic co-occurrence graphs
A semantic co-occurrence graph is a co-occurrence graph that is augmentedwith a concept hierarchy. A concept hierarchy is defined by one or more partialorders of the form: v ⊆ E × E , representing relationships like is-a and is-in,that are reflexive, anti-symmetric and transitive.
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-occurrence graphs
Co-occurrence graph
A co-occurrence graph is a weighted, undirected graph G = (E , �, w), whereE is a set of “entities”, �⊆ E × E is a set of co-occurrences, and w :�→ <indicates support for the co-occurrence
Co-occurrence versus n-partite graphs
Semantic co-occurrence graphs
A semantic co-occurrence graph is a co-occurrence graph that is augmentedwith a concept hierarchy. A concept hierarchy is defined by one or more partialorders of the form: v ⊆ E × E , representing relationships like is-a and is-in,that are reflexive, anti-symmetric and transitive.
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-occurrence graphs
Co-occurrence graph
A co-occurrence graph is a weighted, undirected graph G = (E , �, w), whereE is a set of “entities”, �⊆ E × E is a set of co-occurrences, and w :�→ <indicates support for the co-occurrence
Co-occurrence versus n-partite graphs
Semantic co-occurrence graphs
A semantic co-occurrence graph is a co-occurrence graph that is augmentedwith a concept hierarchy. A concept hierarchy is defined by one or more partialorders of the form: v ⊆ E × E , representing relationships like is-a and is-in,that are reflexive, anti-symmetric and transitive.
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-occurrence graph
Example:
Concept hierarchy construction
1 Start with a baseOntology
2 Use co-occurrencepatterns to guessconceptual relationshipsacross terms
3 Use concept hierarchyto identify deeperco-occurrence patterns
4 Repeat from step 2 in asemi-automated fashionuntil algorithmstabilizes
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-occurrence graph
Example:
Concept hierarchy construction
1 Start with a baseOntology
2 Use co-occurrencepatterns to guessconceptual relationshipsacross terms
3 Use concept hierarchyto identify deeperco-occurrence patterns
4 Repeat from step 2 in asemi-automated fashionuntil algorithmstabilizes
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-occurrence graphs
Characteristics of co-occurrence graphs
Triadic closure (highly clustered)
Disconnected components or a single component of very smalldiameter
Co-occurrence graph of all noun phrases in Wikipedia has adiameter of 4
Co-occurrence support for entity pairs follow a power-law
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Outline
1 Co-occurrence and Meaning
2 Co-occurrence graphs
3 Interpretation of Co-citations
4 Topical Anchors
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-citation
Co-citation and bibliographic coupling are important metrics in severaldatasets like scientific literature, web pages, wikis, tagging systems likedelicious, etc.
Co-citation of a pair of documents corresponds to the co-occurrence ofthese references (Ex. URLs) in a context
Pair-wise co-citation graphs have the same properties as co-occurrencegraphs
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-citation PatternsHyperlink distance across pairs of highly co-cited pages [8]
0
50
100
150
200
250
300
1 2 3 4 5 6 7 kmax >kmax
k
F
Figure: Hyperlink distance across pairs ofhighly co-cited Web pages
0
2000
4000
6000
8000
10000
12000
1 2 3 4 5 6 7
kmax
>km
ax
kF
Figure: Hyperlink distance across pairs ofhighly co-cited Wikipedia pages
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-citation PatternsHyperlink distance across pairs of highly co-cited pages
Endorsement of a citation
Page A endorses the content of page B
Users reading page A, traverses this link andfinds page B useful too
Users create their own pages citing both Aand B
If A has several outgoing links, and only somepairs of outlinks are co-cited, then co-citationcan be seen as an endorsement of the citation
Topical aggregation
Document A represents content about a“higher-level” topic in terms of is-a or is-inrelationships; and links to (hence co-cites)several pages on “lower-level” topics
Pages on the “lower-level” topics usually citeback the page on the “higher-level” topic,hence giving a citation distance of 2 amongthemselves
Nepotistic co-citations
Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabslike: Home, Departments, Contact Us etc.IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-citation PatternsHyperlink distance across pairs of highly co-cited pages
Endorsement of a citation
Page A endorses the content of page B
Users reading page A, traverses this link andfinds page B useful too
Users create their own pages citing both Aand B
If A has several outgoing links, and only somepairs of outlinks are co-cited, then co-citationcan be seen as an endorsement of the citation
Topical aggregation
Document A represents content about a“higher-level” topic in terms of is-a or is-inrelationships; and links to (hence co-cites)several pages on “lower-level” topics
Pages on the “lower-level” topics usually citeback the page on the “higher-level” topic,hence giving a citation distance of 2 amongthemselves
Nepotistic co-citations
Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabslike: Home, Departments, Contact Us etc.IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-citation PatternsHyperlink distance across pairs of highly co-cited pages
Endorsement of a citation
Page A endorses the content of page B
Users reading page A, traverses this link andfinds page B useful too
Users create their own pages citing both Aand B
If A has several outgoing links, and only somepairs of outlinks are co-cited, then co-citationcan be seen as an endorsement of the citation
Topical aggregation
Document A represents content about a“higher-level” topic in terms of is-a or is-inrelationships; and links to (hence co-cites)several pages on “lower-level” topics
Pages on the “lower-level” topics usually citeback the page on the “higher-level” topic,hence giving a citation distance of 2 amongthemselves
Nepotistic co-citations
Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabslike: Home, Departments, Contact Us etc.IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-citation graph of a web crawlPairs of pages with at least 100 non-nepotistic co-citations
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-citation graph of a web crawl
Co-citation graph depicts non-nepotistic co-citations of atleast 100 or more across pairs of pages
In addition to being made of disconnected components, thegraph also shows various recurring structural motifs like:
StarCliqueClique chainDumb-bell
Interpretations for the above motifs along with examples areexplained in Mutalikdesai and Srinivasa (2009) [4]
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Endorsed hyperlink graph (EHG)
On the web, co-citations usually implies a citation. Hence the EHGis essentially a directed version of the co-citation graph. SomeEHG components are depicted below:
EHG clique chain
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Endorsed citation graph (ECG) for scientific literatureECG of citation info obtained from CiteSeer
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Endorsed citation graph
The ECG over scientific literature data (using CiteSeer) showssimilar componentization of the graph, except, the ECG hasone giant component
Citation in scientific literature has some subtle differencesfrom hyperlink citations
Scientific literature citations are always into the past
Very rarely (if at all) do scientific literature citations formcyclic structures
ECG comprises mostly of weakly connected directed graphcomponents, while EHG may contain strongly connectedcomponents
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
ERankImportance of a page within an EHG
ERank is an authority score of a page within an EHG (ECG)component
Depicts reachability of the page within the component
ERank scores in a component shown to be uncorrelated to thePageRank scores of pages of that component
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
EndorSeer
A Firefox plugin for augmented browsing of Citeseer
Currently shows endorsed citations from among the list ofcitations from any paper
Currently underway: Show the ECG component and ECGneighbourhood of a paper
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Outline
1 Co-occurrence and Meaning
2 Co-occurrence graphs
3 Interpretation of Co-citations
4 Topical Anchors
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Topical Anchors [6, 7]Motivation
Example: “Will my oral insulin drugs, along with my hypertensionand high blood glucose, have any side effects on the health of mypancreas?”
Can a machine detect diabetes as the context?
Another example: A document containing the words, AndyRoddick, Roger Federer and Rafael Nadal.
How likely is it that the word Tennis will be mentioned(semantically) when discussing about these players?
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Topical Anchors [6, 7]Motivation
Example: “Will my oral insulin drugs, along with my hypertensionand high blood glucose, have any side effects on the health of mypancreas?”
Can a machine detect diabetes as the context?
Another example: A document containing the words, AndyRoddick, Roger Federer and Rafael Nadal.
How likely is it that the word Tennis will be mentioned(semantically) when discussing about these players?
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Co-occurrence context
Given a set of query terms, the co-occurrence context isdefined as the subgraph formed by the query terms and theset of terms that co-occur with at least one of the terms
Conjecture: The topical anchor of a set of terms, is a highly authoritative term
that lies with the co-occurrence context of the query terms
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Online Page Importance Computation
Each node i in the context is intialised with a cash ci .
A node a is picked at random and the cash ca is added to its history ha.
Then ca is distributed amongst all its neighbours proportional to the edgeweights.
This process is iterated till the ratio of hi s becomes a near constant.
Node with the largest hi is chosen as the most central node.
Unfortunately OPIC was seen to be unsuitable for determining topical anchors
since it tends to find central nodes for the entire graph
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Cash Leaking Random Walk
Cooccurrence graphs have extremely small diameters (4-5).
Roger Federer to feral child in two hops.
Football becomes most central to Roger Federer and RafaelNadal instead of Tennis.
Solution: Cash Leakage
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Bias and History Vectors
There is a hidden bias between query words for the waycentrality is computed.
Example: Jim Carrey, Hugh Grant, Rajkumar
Bias due to difference in neighbourhood sizes
Bias due to polysemy
Example: Java, Beans, Kaffe
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Bias examples
Query Terms Topical Anchors
Java, Beans, Kaffe Programming language, Indonesia,Food
United States Dollar, Euro, WestAfrican CFA franc
French language, Guinea, Guinea-Bissau
Bayes, Euclid, Ramanujan,Bernoulli
Probability, Mathematics, Number
MIT, Stanford, IIT University, Indian Institute of Tech-nology, Bombay
Leaf, Fruit, Stem, Photosynthesis Linguistics, Plant, TreeBernoulli, Poisson, Weibull, Bino-mial
Godwin, Norway, Harold Godwin-son
Table: Examples with irrelevant topical anchors
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Solution to the topic bias problem
Labelled cash.
Vector models of CLRW
Cash from each of the query term qi is given a “colour” ci . The cash history atany node is hence a vector of the form (v1, v2, . . . vn) showing cash flow historyfor each of the colours. The vector is then normalized as:
v ′i =vi
v
where v = maxi
vi and v ′i ∈ [0, 1]
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Projection
Projection
The line joining ~0n to ~1n
represents points where allquery terms have contributedequally to the cash history.This is called the baseline
Hence, for any given node, itsprojection onto the baselinerepresents the importance ofthe node in being a topicalanchor
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Euclidean Distance
Eucledian distance
Eucledian metric computes theL2 distance from thenormalized cash history vectorof a candidate node with ~1n
Favours uniformity in cashhistory distribution over overallmagnitude of the cash history
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Cosine Similarity
Cosine similarity
Computes the cosine between agiven node’s normalized cashhistory vector and ~1n
Another metric for factoringboth uniformity in cashdistribution and magnitude
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Example results
Query Terms Projection Eucledian Cosine
United States Dol-lar, Euro, WestAfrican CFA franc
French language,Guinea, Guinea-Bissau
Currency, Bank,France
Currency, Bank,France
Bayes, Euclid, Ra-manujan, Bernoulli
Probability, Math-ematics, Number
Mathematics,Mathematician,Euler
Mathematics,Mathematician,Probability distri-bution
MIT, Stanford, IIT University, IndianInstitute of Tech-nology, Bombay
University, Col-lege, Technology
University, Col-lege, Science
Leaf, Fruit, Stem,Photosynthesis
Linguistics, Plant,Tree
Plant, Tree,Species
Plant, Tree,Species
Bernoulli, Poisson,Weibull, Binomial
Godwin, Norway,Harold Godwinson
Mathematics,Probability, Ex-pected Value
Mathematics,Probability, Statis-tics
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
User evaluation
Experimental Setup:
86 volunteer users were given a set of queries and asked to provide topicallabels for these queries ranked according to their perceived importance
66 volunteers answered 100 questions, while the rest answered 30 randomquestions chosen from the 100 questions
User responses were charted for consistency in results (chart shown below)
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
User evaluationCLRW against tf-idf and OPIC
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
ComparisonComparison with Automatic Topic Labeling algorithm [3]
Caveats: Comparison with Eucledian algorithm. ATL requires document
contexts where the topical anchor is present (unlike CLRW which searches on
the co-occurrence graph built over a corpus)
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
Future WorkSeveral open questions..
Topical markers, semantic siblings
Co-occurrence semantics when coupled with concepthierarchies
Automatic detection of semantic relations based onco-occurrence
Automatic attribute identification
Thank You!
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and MeaningCo-occurrence graphs
Interpretation of Co-citationsTopical Anchors
References
[1] A. Biletzki and A. Matar. Ludwig wittgenstein (second revision). Stanford Encyclopedia of Philosophy, May2009.
[2] Gerstner and Kistler. Spiking Neuron Models. Single Neurons, Populations, Plasticity. Cambridge UniversityPress, 2002.
[3] Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In KDD ’07: Proceedings ofthe 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 490–499,New York, NY, USA, 2007. ACM.
[4] M. R. Mutalikdesai and S. Srinivasa. Co-citations as endorsements of citations. Submitted for publication,2009.
[5] S. Pinker. The Language Instinct. Harper Perennial Modern Classics, 2007.
[6] A. R. Rachakonda and S. Srinivasa. Finding the topical anchors of a context using lexical cooccurrence data.In Proceedings of ACM Conference on Information and Knowledge Management (CIKM), 2009.
[7] A. R. Rachakonda and S. Srinivasa. Vector-based ranking techniques for identifying the topical anchors of acontext. In Proceedings of the 15th International Conference on Management of Data (COMAD), 2009.
[8] S. Reddy, S. Srinivasa, and M. R. Mutalikdesai. Measures of ”ignorance” on the web. In Proceedings of theInternational Conference on Management of Data (COMAD), Dec 2006.
IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore