+ All Categories
Home > Science > Meme Extraction from Corpora of Scientific Literature using Citation Networks

Meme Extraction from Corpora of Scientific Literature using Citation Networks

Date post: 02-Jul-2015
Category:
Upload: tobias-kuhn
View: 196 times
Download: 1 times
Share this document with a friend
Description:
I present here my work on how to identify memes in the scientific literature by using the citation network.
26
Meme Extraction from Corpora of Scientific Literature using Citation Networks Tobias Kuhn http://www.tkuhn.ch @txkuhn ETH Zurich Colloquium Institute of Computational Linguistics University of Zurich 25 November 2014
Transcript
Page 1: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Meme Extraction from Corpora of ScientificLiterature using Citation Networks

Tobias Kuhn

http://www.tkuhn.ch

@txkuhn

ETH Zurich

ColloquiumInstitute of Computational Linguistics

University of Zurich25 November 2014

Page 2: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Reference

Journal article on the content of this talk:

Tobias Kuhn, Matjaz Perc, and Dirk Helbing. Inheritance patterns incitation networks reveal scientific memes. Physical Review X, 4,041036, 21 November 2014. https://journals.aps.org/prx/abstract/10.1103/PhysRevX.4.041036

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 2 / 22

Page 3: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Meme Detection

I am presenting an approach on “meme detection”, which is relatedto a number of existing problems and approaches:

• Named-entity extraction

• Keyphrase extraction

• Topic modeling

• Terminology extraction

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 3 / 22

Page 4: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Context for NLP

Most NLP approaches focus on the analysis of the texts themselves:

• Grammar

• Morphology

• Text Structure

• Statistical Patterns

Some also take the contexts of the texts into account:

• Comparison to properties of entire corpus (e.g. tf–idf)

• Training on particular corpus/domain/speaker

• Citation graph of scientific publications

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 4 / 22

Page 5: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Citation Graph of Scientific Publications

Nodes: publicationsEdges: citations (in gray)

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 5 / 22

Page 6: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Citation Graph of Scientific Publications

Nodes: publicationsEdges: citations (in gray)

Legend:Natural/Agricultural Sciences

(except Physical Sciences)

Physical SciencesEngineering and TechnologyMedical and Health SciencesSocial Sciences / Humanities

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 6 / 22

Page 7: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Citation Graph of Scientific Publications

Nodes: publicationsEdges: citations (in gray)

Legend:Natural/Agricultural Sciences

(except Physical Sciences)

Physical SciencesEngineering and TechnologyMedical and Health SciencesSocial Sciences / Humanities

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 7 / 22

Page 8: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Citation Graph of Scientific Publications

Entire giant component (33million nodes) of the citationgraph of Thomson Reuter’sWeb of Science dataset.

Legend:Natural/Agricultural Sciences

(except Physical Sciences)

Physical SciencesEngineering and TechnologyMedical and Health SciencesSocial Sciences / Humanities

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 8 / 22

Page 9: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Citation Graph: American Physical Society

Citation graph of the Phys-ical Review journals (463knodes).

Legend:A: Atomic, molecular,

optical phys.B: Condensed matter,

materials phys.C: Nuclear phys.D: Particles, fields, gravitation,

cosmologyE: Statistical, nonlinear,

soft matter phys.other journals

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 9 / 22

Page 10: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Citation Graph: Memes

Specific phrases or “memes”localize to specific regions inthe citation graph.

Legend:quantumfissiongrapheneself-organized criticalitytraffic flow

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 10 / 22

Page 11: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Scientific Memes

“Meme” was coined by Richard Dawkins:

“Just as genes propagate themselves in the gene pool by leaping from bodyto body via sperm or eggs, so memes propagate themselves in the meme poolby leaping from brain to brain via a process which, in the broad sense, canbe called imitation.” [Dawkins, The Selfish Gene]

Examples of memes:

• Melodies

• Recipes

• Cultural habits

• Words, grammar rules, text style

• Scientific concepts

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 11 / 22

Page 12: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Genes/Memes as Network Patterns!

Dawkins’ Definition of “Gene”:“I am using the word gene to mean a genetic unit that is small enough to lastfor a number of generations and to be distributed around in many copies.”[Dawkins, The Selfish Gene]

Our Working Definition of “Scientific Meme”:

A scientific meme is a short unit of text in a publication that is replicated inciting publications and thereby distributed around in many copies.

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 12 / 22

Page 13: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Propagation Score

Propagation score P quantifies the degree to which a meme’soccurrence aligns with the citation graph:

Pm =sticking factor

sparking factor=

?

/?

=dm→m

d→m

/dm→�md→�m

To prevent that some infrequent phrases get a high propagation scoreby chance, we can add small amount of controlled noise δ (we useδ = 3):

Pm =dm→m

d→m + δ

/dm→�m + δ

d→�m + δ

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 13 / 22

Page 14: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Frequency/Propagation Score for APS Datarelative

frequency

10−2

100

102

104

106

10−6

10−4

10−2

100

APS

N = 1,372,365

quantum

fissiongraphene

self-organizedcriticality

traffic flow

propagation score →

density

ofn-grams:

100

101

102

103

104

105

1

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 14 / 22

Page 15: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Meme ScoreMeme score M as the Product of relative frequency f andpropagation score P:

Mm = fmPm

Top 20 Memes for APS (Physics):

1. loop quantum cosmology+* 11. dark energy+*2. unparticle+* 12. Rashba3. sonoluminescence+* 13. CuGeO3

+

4. MgB2+ 14. strange nonchaotic

5. stochastic resonance+* 15. in NbSe3

6. carbon nanotubes+* 16. spin Hall+

7. NbSe3+ 17. elliptic flow+*

8. black hole+* 18. quantum Hall+*9. nanotubes+ 19. CeCoIn5

+

10. lattice Boltzmann+* 20. inflation+

+ annotators agreed that this is an interesting and important physics concept* also found on the list of terms extracted from Wikipedia

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 15 / 22

Page 16: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Properties of the Meme Score

The meme score has a number of nice properties:

• Can be calculated efficiently and exhaustively even on very largedataset

• No upper limit on the length of n-grams

• No dependence on external linguistic or ontological knowledge

• No stop-word lists or other kinds of arbitrary filters or thresholds

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 16 / 22

Page 17: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Manual Annotation

• Two annotators (A1, A2): PhD students with physics degree• Annotation with respect to (1) physics concept or not and (2)

linguistic category• Randomly extracted phrases for comparison

physics concept not a physics concept

noun phrase verb adjective or adverb other

meme score

A1A2A1A2

random

A1A2A1A2

weighted random

terms30 60 90 120 150

A1A2A1A2

1

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 17 / 22

Page 18: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Comparison to Alternative Metrics

0 0.1 0.2 0.3 0.4 0.5

meme score

frequency

max. absolutechange

over time

max. relativechange

over time

max. absolutedifference

across journals

max. relativedifference

across journals

A (area under curve)

101

102

103

0

20

40

60

80

100

top x terms by meme score

pe

rce

nta

ge

of

Wik

ipe

dia

te

rms

40% of top 50 terms are found on Wikipedia list

1

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 18 / 22

Page 19: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Evolution over Time: Exemplary Memes

0.5 1 1.5 2 2.5 3 3.5 4 4.5

x 105

0

2

4

6

8

10

12

14

publication count

mem

e s

core

(δ =

1)

19

4019

6019

7019

8019

8219

8419

8619

8819

9019

9219

9419

9619

9820

0020

0220

0420

0620

08

quantum

fission

graphene

self−organized criticality

traffic flow

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 19 / 22

Page 20: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Evolution over Time

0.5 1 1.5 2 2.5 3 3.5 4 4.5

x 105

0

2

4

6

8

10

12

publication count

mem

e sc

ore

1940

1960

1970

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

grapheneentanglement

MgB2

nanotubescarbon nanotubes

quarkneutrino

Bose−Einsteinquantum Hall

blackC

60Hubbard model

quantum wellsgraphite

reactionsphotoemission

black holetricritical

Kondosuperconducting

fissionMeV

diffuse scattering

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 20 / 22

Page 21: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Conclusions

The citation graph is a very powerful resource to detect memes.

Combined with other existing approaches, this seems to be apromising tool for NLP on scientific publications.

Could be applied to other types of texts that have a certain kind ofcitation structure (legal texts?).

Allows for studying memes in an exhaustive manner.

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 21 / 22

Page 22: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Thank you for your Attention!

Questions?

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 22 / 22

Page 23: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Randomized Networkrelative

frequency

10−2

100

102

104

106

10−6

10−4

10−2

100

APSrandomized

(time preserving)

N = 89,356

propagation score →

density

ofn-grams:

100

101

102

103

104

105

1

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 23 / 22

Page 24: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Meme Score Calculation

1 Collect all phrases that stick at least once (not counting“free-riding” on larger memes)

2 Calculate sticking and sparking factors for all collected phrases(Mm = fmPm with Pm =

sticking factor

sparking factor=

dm→m

d→m + δ

/dm→�m

+ δ

d→�m+ δ

)

Example

Citing title:covariant effective action for loop quantum cosmology from order reduction

Cited titles:– quantum nature of the big bang– absence of a singularity in loop quantum cosmology– large scale effective theory for cosmological bounces

Sticking phrases: loop quantum cosmology, quantum, effective, forSparking phrases: covariant, covariant effective action, order reduction, ...

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 24 / 22

Page 25: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Top Meme Scores for Web of Science Data

1. MgB2 11. loop quantum cosmology2. lattice Boltzmann 12. zero-divisor3. graphene 13. BiFeO3

4. on chalcogenolates 14. Neospora5. Ti3SiC2 15. Papuloerythroderma6. harmony search 16. Neospora caninum7. seasonal climate summary 17. metal dusting

southern hemisphere 18. porcine circovirus8. empirical likelihood 19. cone metric9. proxy re-encryption 20. ranked set

10. spiking neural P systems

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 25 / 22

Page 26: Meme Extraction from Corpora of Scientific Literature using Citation Networks

Top Meme Scores for PubMed Central Data

1. Buruli ulcer 11. Nipah virus2. G-quadruplex 12. miRNA3. miRNAs 13. microRNAs4. chronic cerebrospinal venous 14. hepatitis E virus

insufficiency 15. the 45 and Up Study5. cerebrospinal venous 16. chronic cerebrospinal venous6. Mycobacterium ulcerans insufficiency (CCSVI)7. enterovirus 71 17. EV718. G-quadruplexes 18. bluetongue9. CCSVI 19. Schmallenberg virus

10. malaria 20. Nipah

Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 26 / 22


Recommended