+ All Categories
Home > Documents > SCIENCE MAPS

SCIENCE MAPS

Date post: 08-Jan-2016
Category:
Upload: thy
View: 44 times
Download: 1 times
Share this document with a friend
Description:
SCIENCE MAPS. SI767 – W10 – Matthew P. Simmons. Overview. What are Science Maps? Various definitions. Usage and utility. What techniques are used? Types of techniques. Overview of certain techniques. What has been done on the topic? Review of papers. The small world of these readings. - PowerPoint PPT Presentation
Popular Tags:
43
SCIENCE MAPS SI767 – W10 – Matthew P. Simmons
Transcript
Page 1: SCIENCE MAPS

SCIENCE MAPS

SI767 – W10 – Matthew P. Simmons

Page 2: SCIENCE MAPS

Overview

• What are Science Maps?o Various definitions.o Usage and utility.

• What techniques are used?o Types of techniques.o Overview of certain techniques.

• What has been done on the topic?o Review of papers.o The small world of these readings.

Page 3: SCIENCE MAPS

What are Science Maps?

...and why do we care?

Page 4: SCIENCE MAPS

In the reading...

Science Maps are ...• Topic models-

o Detecting the finite number of themes/topics that characterize the content of a knowledge domain.

• Scientometric analysis of the provenance of ideas-o Tracking the memetic flow in scientific literature by analyzing the

pattern of citations and collaborations.• Models of the evolution of bibliometric networks-

o Finding the parameters that generate networks with similar structure to citation and coauthorship networks, and determining why those parameters matter.

Page 5: SCIENCE MAPS

Various techniques/approaches;Common themeFind the hidden ontologies that organize data.Organize scientific data by:• Topic -

o "Which articles are about the same thing as this one?"• Influence -

o "What are the five most important papers in topic modeling?" • Provenance -

o "Where are bayesian inference theories coming from, and who is using them?"

• "Hotness" of a field -o "Where are the research dollars in NLP today?"

Page 6: SCIENCE MAPS

Usage and Utility

• Enhance the ability to navigate data.• Identify potential collaborators.• Determine the impact of an author or paper.• Identify the problems that need solving (and where the

money is...)• Create tools to allow our allocation of attention to efficiently

scale with the massive increase of data.• "Revealing implicit knowledge that is presently known only

to domain experts..."(Shiffrin et al. 2004)

Page 7: SCIENCE MAPS

What Techniques are Used?

 Hint: Lots!

Page 8: SCIENCE MAPS

Answer: Lots!

• Support Vector Machines• Clustering• Latent Dirichlet Allocation• Latent Semantic Analysis• Mixture Models

• Network modeling• Network analysis• Network visualization• Bibliometric analysis

...and that's just what's mentioned in Shiffrin et al.!

Page 9: SCIENCE MAPS

In the readings...

• Markov Chain Monte Carlo – Griffiths et. Al.• Latent Dirichlet Allocation – Blei et al. & Griffiths et. Al.• Pathfinder [Networks|Scaling] – Chen• Bayesian Networks – Blei et. Al & Griffiths et. Al.• Bibliometric networks - Garfield, Borner et al. & Chen

...and more, but there is an easy way to group them.

Page 10: SCIENCE MAPS

Two main categories

Statistical models/methods • Markov Chain• Monte Carlo method• Dirichlet Distribution• Bayesian Inference• Latent Dirichlet Allocation

Network models/methods • Degree distributions• Centrality measures• Scale free networks• Small world networks• Clustering coefficients• Pathfinder scaling

Page 11: SCIENCE MAPS

Everyone got all that?

Stats Stuff

Page 12: SCIENCE MAPS

Just kidding...

Let's start with Markov Chains

Page 13: SCIENCE MAPS

Markov Chains

• A system that can exist in various states where the components of that system change in discrete steps.

• The changes of the components are determined by the transitions probability which displays the Markov property.

• The Markov property states the the state of a component at time n+1 is dependent on the state of the system at time n, but not at any time < n. Hence the immediately previous state is the only important factor in determining the next state of the system.

• Example: A random walk on the number line with an equal probability of moving +1 or -1 at each step.

Largely from: http://en.wikipedia.org/wiki/Markov_chain

Page 14: SCIENCE MAPS

Monte Carlo method

• A process that utilizes repeated random sampling to derive an approximated result.

 General Process: – Define a domain of possible inputs.– Generate inputs randomly from the domain using a certain

specified probability distribution.– Perform a deterministic computation using the inputs.– Aggregate the results of the individual computations into the

final result. 

From: http://en.wikipedia.org/wiki/Monte_carlo_method

Page 15: SCIENCE MAPS

"By our powers combined..."

Markov Chain Monte Carlo Method

• AKA Gibbs sampling

From:Dirichlet Processes, Chinese Restaurant Processes and All That, Michael I. Jordan 2005http://www.cs.berkeley.edu/~jordanDavid MacKay, Information theory, inference, and learning algorithms (Cambridge  UK ;;New York: Cambridge University Press, 2003).

Page 16: SCIENCE MAPS

Bayesian Paradigm

From:  Structured Bayesian Nonparametric Models with Variational InferenceACL TutorialPrague, Czech RepublicPercy Liang and Dan Kleinhttp://www.cs.berkeley.edu/~pliang/papers/tutorial-acl2007.pdf

Page 17: SCIENCE MAPS

Bayesian Inference 

From:Robert Cowell, Introduction to inference for Bayesian networks, in Learning in graphical models, ed. Michael Jordan (MIT Press, 1999), 9-27.

Think of A as the topics in a document and B as the words observed.

The goal is to infer the most probable topic distribution given the observed words.

Page 18: SCIENCE MAPS

Latent Dirichlet Allocation

• A generative document model• Each document is composed of a number of words drawn

from a number of topics that comprise the document.• The is a probability distribution of topics defined across

documents and a probability distribution of words defined across topics.

   ...pictures help here...

Page 19: SCIENCE MAPS

Latent Dirichlet Allocation - Cont.

From:David M. Blei, Andrew Y. Ng, and Michael I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res. 3 (2003): 993-1022.

Page 20: SCIENCE MAPS

Dirichlet Distribution

From:  Structured Bayesian Nonparametric Models with Variational InferenceACL TutorialPrague, Czech RepublicPercy Liang and Dan Kleinhttp://www.cs.berkeley.edu/~pliang/papers/tutorial-acl2007.pdf

Page 21: SCIENCE MAPS

Network Analysis/Methods

From: Bollen, J., Van de Sompel, H., Hagberg, A., Bettencourt, L., Chute, R., Rodriguez, M. A., et al. (2009). Clickstream Data Yields High-Resolution Maps of Science. PLoS ONE, 4(3), e4803. doi: 10.1371/journal.pone.0004803

Page 22: SCIENCE MAPS

Centrality

• Betweenness:o Bridgingo # of shortest paths

through a node to other nodes.

• Closenesso Avg distance to other

nodes. • Degree

o Number of edges of one type or another.

Page 23: SCIENCE MAPS

Pathfinder Scaling

• Network scaling (edge reduction) method.• Generates a minimum spanning tree plus a parameter

tunable number of redundant edges.• Can use different metrics to determine which edges to

prune, such as the euclidian distance or edge weight.

Page 24: SCIENCE MAPS

What's been done on the topic?

What did we learn today, class?

Page 25: SCIENCE MAPS

R. M. Shiffrin and K. Börner, Mapping knowledge domains, Proceedings of the National Academy of Sciences 101, no. suppl_1 (1, 2004): 5183-5185.

• Overview of the field and the PNAS articles.• Important take away: There is a lot of research potential in

this area and one that benefits from an interdisciplinary analysis involving a vareity of techniques.

Page 26: SCIENCE MAPS

T. L. Griffiths and Mark Steyvers, Finding scientific topics, Proceedings of the National Academy of Sciences 101, no. suppl_1 (1, 2004): 5228-5235.

• LDA• Optimal Topics ~300• LDA over PNAS abstracts 1991-2001• ~3million words total - ~20k terms in vocabulary

Contributions:• Hot and Cold topics• Topical clustering (heatmap)• Demonstrate that content analysis can reveal topics

Page 27: SCIENCE MAPS

The word distribution within 10 topics.

30 randomly generated "documents"generated from the above 10 topics.

Using LDA to derive the original topics from the observed documents.

Page 28: SCIENCE MAPS

Griffiths et al. cont...

Cold Topics.        Hot Topics!

Page 29: SCIENCE MAPS

David M. Blei and John D. Lafferty, A correlated topic model of Science, The Annals of Applied Statistics 1, no. 1 (2007): 17-35

• Correlated Topic Modelso Evolution of LDAo Introduces the notion that the probability of the topics

comprising a document are not necessarily independent.o Replaces the use of the Dirichlet distribution with a log

normal distribution with a covariance structure as a parameter.

Page 30: SCIENCE MAPS

Blei et al. cont...

Contributions:• CTM outperforms LDA when the number of topics is larger.• CTM also predicts words more accurately with less training

data than LDA.• Both of these are credited to the effect of topic correlation on

the distribution of topics in a document.• A science map!

o The covariance matrix is used to create a graph where the topics are vertices and the edges represent some level of covariance.

Page 31: SCIENCE MAPS

Blei et al.

Page 32: SCIENCE MAPS

Katy Börner and Jeegar T. Maru and Robert L. Goldstone, The simultaneous evolution of author and paper networks, Proceedings of the National Academy of Sciences 101, no. suppl_1 (1, 2004): 5266-5273.

• Bibliometric network encompassing coauthorship and citation.

• Built to model the PNAS collaboration/citation network. • 2 node types:

o Authorso Papers

• Several edge types:o directional information flow between paper:author and

paper:papero author and coauthor

Page 33: SCIENCE MAPS

Börner cont...3 main parameters in the model:1. Topics - i.e. scientific specializations– Aging - Meant to capture the bias to cite recent material.– Recursive Linking - The propensity to read the papers cited

by the papers you have read. Iterative simulation that modeled the addition of new authors, the removal of old ones, coauthorship, the propensity of authors to publish within their topic domain.

Page 34: SCIENCE MAPS

Looks like a decent fit...

Page 35: SCIENCE MAPS

Börner cont...Contributions:• Models the constraint of aging on preferential attachment in

scale free network formation.• Models the "splintering" of science caused by specialization.

Page 36: SCIENCE MAPS

Chaomei Chen, CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature, Journal of the American Society for Information Science and Technology 57, no. 3 (2006): 359-377.

• Co-citation network.• Clusters labeled based on Kleinberg's burst detection

algorithm (Kleinberg 2002).9 steps:– Identify knowledge domain. i.e. "mass extinction"– Automated data collection. Uses PubMed and Web of Science.– Find burst terms. CiteSpace II scrapes [1-4]-grams.– Time slicing. Generate time series views of the network.– Choose thresholds (intellectual bases & research fronts).– Graph scaling. Reduce edges to improve visual clarity without sacrificing

critical visual data.– Layout. Typically a force directed layout to emphasize clustering.– Visual inspection. Tweak labels and display of metadata.– Verify pivot points.

Page 37: SCIENCE MAPS

Chen cont...

Contributions:• Detecting research fronts,

intellectual bases, and pivots.

• Detecting trends in scientific research.

•  Visualization of knowledge domain.

Page 38: SCIENCE MAPS

Tree ring view of citations over time

Overview of Citespace II

Page 39: SCIENCE MAPS

Eugene Garfield, Historiographic Mapping of Knowledge Domains Literature, Journal of Information Science 30, no. 2 (April 1, 2004): 119-145.

• Co-founder of scientometrics.• Bibliometric analysis and link tracking reveal impact of

papers on a field.• Concept of local citation score and group citation score.• Adding group and time slicing to learn more about the effect

of those slices on bibliometric records.

Page 40: SCIENCE MAPS

Garfield cont...

Contributions:• Bibliometrics...• Finding out which papers

were important at a certain time vs. from a current perspective.

Page 41: SCIENCE MAPS

Thank You

Questions or Comments?

Page 42: SCIENCE MAPS

Bonus Bibliography Slide!• Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of Science. The Annals of Applied Statistics, 1(1), 17-35.

doi: 10.1214/07-AOAS114• Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. J. Mach. Learn. Res., 3, 993-1022.• Bollen, J., Van de Sompel, H., Hagberg, A., Bettencourt, L., Chute, R., Rodriguez, M. A., et al. (2009). Clickstream Data

Yields High-Resolution Maps of Science. PLoS ONE, 4(3), e4803. doi: 10.1371/journal.pone.0004803• Borner, K. (2004). The simultaneous evolution of author and paper networks. Proceedings of the National Academy of

Sciences, 101(suppl_1), 5266-5273. doi: 10.1073/pnas.0307625100• Börner, K. (2007). Making sense of mankind’s scholarly knowledge and expertise: collecting, interlinking, and organizing

what we know and different approaches to mapping (network) science. Environment and Planning B: Planning and Design, 34(5), 808 – 825. doi: 10.1068/b3302t

• Chen, C. (2006). CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of the American Society for Information Science and Technology, 57(3), 359-377. doi: 10.1002/asi.20317

• Cowell, R. (1999). Introduction to inference for Bayesian networks. In M. Jordan (Ed.), Learning in graphical models (pp. 9-27). MIT Press.

• Garfield, E. (2004). Historiographic Mapping of Knowledge Domains Literature. Journal of Information Science, 30(2), 119-145. doi: 10.1177/0165551504042802

• Griffiths, T. L. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl_1), 5228-5235. doi: 10.1073/pnas.0307752101

• Hall, D., Jurafsky, D., & Manning, C. D. (2008). Studying the history of ideas using topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 363-371). Honolulu, Hawaii: Association for Computational Linguistics. Retrieved from http://portal.acm.org.proxy.lib.umich.edu/citation.cfm?id=1613715.1613763

• Hirsch, J. E. (2005). An index to quantify an individual's scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46), 16569-16572. doi: 10.1073/pnas.0507655102

• Janssens, F., Glänzel, W., & Moor, B. D. (2007). Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 360-369). San Jose, California, USA: ACM. Retrieved from http://portal.acm.org.proxy.lib.umich.edu/citation.cfm?id=1281192.1281233

Page 43: SCIENCE MAPS

Bibliography continued...

• Jordan, M. I. (1999). Learning in graphical models. MIT Press.• Leicht, E. A., Clarkson, G., Shedden, K., & Newman, M. E. J. (2007). Large-scale structure of time evolving citation networks.

0706.0015. doi: doi:10.1140/epjb/e2007-00271-7• MacKay, D. (2003). Information theory, inference, and learning algorithms. Cambridge  UK ;;New York: Cambridge University

Press.• Shibata, N., Kajikawa, Y., Takeda, Y., & Matsushima, K. (2009). Comparative study on methods of detecting research fronts

using different types of citation. J. Am. Soc. Inf. Sci. Technol., 60(3), 571-580.• Shiffrin, R. M. (2004). Mapping knowledge domains. Proceedings of the National Academy of Sciences, 101(suppl_1), 5183-

5185. doi: 10.1073/pnas.0307852100• Torres-Moreno, J., St-Onge, P., Gagnon, M., El-Bèze, M., & Bellot, P. (2009, May 1). Automatic Summarization System

coupled with a Question-Answering System (QAAS). ArXiv e-prints. Retrieved January 11, 2010, from http://adsabs.harvard.edu/abs/2009arXiv0905.2990T

• Zhu D., & Porter A.L.[1]. (2002). Automated extraction and visualization of information for technological intelligence and forecasting. Technological Forecasting and Social Change, 69, 495-506. doi: 10.1016/S0040-1625(01)00157-3


Recommended