SCIENCE MAPS

SCIENCE MAPS

SI767 – W10 – Matthew P. Simmons

Overview

• What are Science Maps?o Various definitions.o Usage and utility.

• What techniques are used?o Types of techniques.o Overview of certain techniques.

• What has been done on the topic?o Review of papers.o The small world of these readings.

What are Science Maps?

...and why do we care?

In the reading...

Science Maps are ...• Topic models-

o Detecting the finite number of themes/topics that characterize the content of a knowledge domain.

• Scientometric analysis of the provenance of ideas-o Tracking the memetic flow in scientific literature by analyzing the

pattern of citations and collaborations.• Models of the evolution of bibliometric networks-

o Finding the parameters that generate networks with similar structure to citation and coauthorship networks, and determining why those parameters matter.

Various techniques/approaches;Common themeFind the hidden ontologies that organize data.Organize scientific data by:• Topic -

o "Which articles are about the same thing as this one?"• Influence -

o "What are the five most important papers in topic modeling?" • Provenance -

o "Where are bayesian inference theories coming from, and who is using them?"

• "Hotness" of a field -o "Where are the research dollars in NLP today?"

Usage and Utility

• Enhance the ability to navigate data.• Identify potential collaborators.• Determine the impact of an author or paper.• Identify the problems that need solving (and where the

money is...)• Create tools to allow our allocation of attention to efficiently

scale with the massive increase of data.• "Revealing implicit knowledge that is presently known only

to domain experts..."(Shiffrin et al. 2004)

What Techniques are Used?

Hint: Lots!

Answer: Lots!

• Support Vector Machines• Clustering• Latent Dirichlet Allocation• Latent Semantic Analysis• Mixture Models

• Network modeling• Network analysis• Network visualization• Bibliometric analysis

...and that's just what's mentioned in Shiffrin et al.!

In the readings...

• Markov Chain Monte Carlo – Griffiths et. Al.• Latent Dirichlet Allocation – Blei et al. & Griffiths et. Al.• Pathfinder [Networks|Scaling] – Chen• Bayesian Networks – Blei et. Al & Griffiths et. Al.• Bibliometric networks - Garfield, Borner et al. & Chen

...and more, but there is an easy way to group them.

Two main categories

Statistical models/methods • Markov Chain• Monte Carlo method• Dirichlet Distribution• Bayesian Inference• Latent Dirichlet Allocation

Network models/methods • Degree distributions• Centrality measures• Scale free networks• Small world networks• Clustering coefficients• Pathfinder scaling

Everyone got all that?

Stats Stuff

Just kidding...

Let's start with Markov Chains

Markov Chains

• A system that can exist in various states where the components of that system change in discrete steps.

• The changes of the components are determined by the transitions probability which displays the Markov property.

• The Markov property states the the state of a component at time n+1 is dependent on the state of the system at time n, but not at any time < n. Hence the immediately previous state is the only important factor in determining the next state of the system.

• Example: A random walk on the number line with an equal probability of moving +1 or -1 at each step.

Largely from: http://en.wikipedia.org/wiki/Markov_chain

Monte Carlo method

• A process that utilizes repeated random sampling to derive an approximated result.

General Process: – Define a domain of possible inputs.– Generate inputs randomly from the domain using a certain

specified probability distribution.– Perform a deterministic computation using the inputs.– Aggregate the results of the individual computations into the

final result.

From: http://en.wikipedia.org/wiki/Monte_carlo_method

"By our powers combined..."

Markov Chain Monte Carlo Method

• AKA Gibbs sampling

From:Dirichlet Processes, Chinese Restaurant Processes and All That, Michael I. Jordan 2005http://www.cs.berkeley.edu/~jordanDavid MacKay, Information theory, inference, and learning algorithms (Cambridge UK ;;New York: Cambridge University Press, 2003).

Bayesian Paradigm

From: Structured Bayesian Nonparametric Models with Variational InferenceACL TutorialPrague, Czech RepublicPercy Liang and Dan Kleinhttp://www.cs.berkeley.edu/~pliang/papers/tutorial-acl2007.pdf

Bayesian Inference

From:Robert Cowell, Introduction to inference for Bayesian networks, in Learning in graphical models, ed. Michael Jordan (MIT Press, 1999), 9-27.

Think of A as the topics in a document and B as the words observed.

The goal is to infer the most probable topic distribution given the observed words.

Latent Dirichlet Allocation

• A generative document model• Each document is composed of a number of words drawn

from a number of topics that comprise the document.• The is a probability distribution of topics defined across

documents and a probability distribution of words defined across topics.

...pictures help here...

Latent Dirichlet Allocation - Cont.

From:David M. Blei, Andrew Y. Ng, and Michael I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res. 3 (2003): 993-1022.

Dirichlet Distribution

From: Structured Bayesian Nonparametric Models with Variational InferenceACL TutorialPrague, Czech RepublicPercy Liang and Dan Kleinhttp://www.cs.berkeley.edu/~pliang/papers/tutorial-acl2007.pdf

Network Analysis/Methods

From: Bollen, J., Van de Sompel, H., Hagberg, A., Bettencourt, L., Chute, R., Rodriguez, M. A., et al. (2009). Clickstream Data Yields High-Resolution Maps of Science. PLoS ONE, 4(3), e4803. doi: 10.1371/journal.pone.0004803

Centrality

• Betweenness:o Bridgingo # of shortest paths

through a node to other nodes.

• Closenesso Avg distance to other

nodes. • Degree

o Number of edges of one type or another.

Pathfinder Scaling

• Network scaling (edge reduction) method.• Generates a minimum spanning tree plus a parameter

tunable number of redundant edges.• Can use different metrics to determine which edges to

prune, such as the euclidian distance or edge weight.

What's been done on the topic?

What did we learn today, class?

R. M. Shiffrin and K. Börner, Mapping knowledge domains, Proceedings of the National Academy of Sciences 101, no. suppl_1 (1, 2004): 5183-5185.

• Overview of the field and the PNAS articles.• Important take away: There is a lot of research potential in

this area and one that benefits from an interdisciplinary analysis involving a vareity of techniques.

T. L. Griffiths and Mark Steyvers, Finding scientific topics, Proceedings of the National Academy of Sciences 101, no. suppl_1 (1, 2004): 5228-5235.

• LDA• Optimal Topics ~300• LDA over PNAS abstracts 1991-2001• ~3million words total - ~20k terms in vocabulary

Contributions:• Hot and Cold topics• Topical clustering (heatmap)• Demonstrate that content analysis can reveal topics

The word distribution within 10 topics.

30 randomly generated "documents"generated from the above 10 topics.

Using LDA to derive the original topics from the observed documents.

Griffiths et al. cont...

Cold Topics. Hot Topics!

David M. Blei and John D. Lafferty, A correlated topic model of Science, The Annals of Applied Statistics 1, no. 1 (2007): 17-35

• Correlated Topic Modelso Evolution of LDAo Introduces the notion that the probability of the topics

comprising a document are not necessarily independent.o Replaces the use of the Dirichlet distribution with a log

normal distribution with a covariance structure as a parameter.

Blei et al. cont...

Contributions:• CTM outperforms LDA when the number of topics is larger.• CTM also predicts words more accurately with less training

data than LDA.• Both of these are credited to the effect of topic correlation on

the distribution of topics in a document.• A science map!

o The covariance matrix is used to create a graph where the topics are vertices and the edges represent some level of covariance.

Blei et al.

Katy Börner and Jeegar T. Maru and Robert L. Goldstone, The simultaneous evolution of author and paper networks, Proceedings of the National Academy of Sciences 101, no. suppl_1 (1, 2004): 5266-5273.

• Bibliometric network encompassing coauthorship and citation.

• Built to model the PNAS collaboration/citation network. • 2 node types:

o Authorso Papers

• Several edge types:o directional information flow between paper:author and

paper:papero author and coauthor

Börner cont...3 main parameters in the model:1. Topics - i.e. scientific specializations– Aging - Meant to capture the bias to cite recent material.– Recursive Linking - The propensity to read the papers cited

by the papers you have read. Iterative simulation that modeled the addition of new authors, the removal of old ones, coauthorship, the propensity of authors to publish within their topic domain.

Looks like a decent fit...

Börner cont...Contributions:• Models the constraint of aging on preferential attachment in

scale free network formation.• Models the "splintering" of science caused by specialization.

Chaomei Chen, CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature, Journal of the American Society for Information Science and Technology 57, no. 3 (2006): 359-377.

• Co-citation network.• Clusters labeled based on Kleinberg's burst detection

algorithm (Kleinberg 2002).9 steps:– Identify knowledge domain. i.e. "mass extinction"– Automated data collection. Uses PubMed and Web of Science.– Find burst terms. CiteSpace II scrapes [1-4]-grams.– Time slicing. Generate time series views of the network.– Choose thresholds (intellectual bases & research fronts).– Graph scaling. Reduce edges to improve visual clarity without sacrificing

critical visual data.– Layout. Typically a force directed layout to emphasize clustering.– Visual inspection. Tweak labels and display of metadata.– Verify pivot points.

Chen cont...

Contributions:• Detecting research fronts,

intellectual bases, and pivots.

• Detecting trends in scientific research.

• Visualization of knowledge domain.

Tree ring view of citations over time

Overview of Citespace II

Eugene Garfield, Historiographic Mapping of Knowledge Domains Literature, Journal of Information Science 30, no. 2 (April 1, 2004): 119-145.

• Co-founder of scientometrics.• Bibliometric analysis and link tracking reveal impact of

papers on a field.• Concept of local citation score and group citation score.• Adding group and time slicing to learn more about the effect

of those slices on bibliometric records.

Garfield cont...

Contributions:• Bibliometrics...• Finding out which papers

were important at a certain time vs. from a current perspective.

Thank You

Questions or Comments?

Bonus Bibliography Slide!• Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of Science. The Annals of Applied Statistics, 1(1), 17-35.

doi: 10.1214/07-AOAS114• Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. J. Mach. Learn. Res., 3, 993-1022.• Bollen, J., Van de Sompel, H., Hagberg, A., Bettencourt, L., Chute, R., Rodriguez, M. A., et al. (2009). Clickstream Data

Yields High-Resolution Maps of Science. PLoS ONE, 4(3), e4803. doi: 10.1371/journal.pone.0004803• Borner, K. (2004). The simultaneous evolution of author and paper networks. Proceedings of the National Academy of

Sciences, 101(suppl_1), 5266-5273. doi: 10.1073/pnas.0307625100• Börner, K. (2007). Making sense of mankind’s scholarly knowledge and expertise: collecting, interlinking, and organizing

what we know and different approaches to mapping (network) science. Environment and Planning B: Planning and Design, 34(5), 808 – 825. doi: 10.1068/b3302t

• Chen, C. (2006). CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of the American Society for Information Science and Technology, 57(3), 359-377. doi: 10.1002/asi.20317

• Cowell, R. (1999). Introduction to inference for Bayesian networks. In M. Jordan (Ed.), Learning in graphical models (pp. 9-27). MIT Press.

• Garfield, E. (2004). Historiographic Mapping of Knowledge Domains Literature. Journal of Information Science, 30(2), 119-145. doi: 10.1177/0165551504042802

• Griffiths, T. L. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl_1), 5228-5235. doi: 10.1073/pnas.0307752101

• Hall, D., Jurafsky, D., & Manning, C. D. (2008). Studying the history of ideas using topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 363-371). Honolulu, Hawaii: Association for Computational Linguistics. Retrieved from http://portal.acm.org.proxy.lib.umich.edu/citation.cfm?id=1613715.1613763

• Hirsch, J. E. (2005). An index to quantify an individual's scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46), 16569-16572. doi: 10.1073/pnas.0507655102

• Janssens, F., Glänzel, W., & Moor, B. D. (2007). Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 360-369). San Jose, California, USA: ACM. Retrieved from http://portal.acm.org.proxy.lib.umich.edu/citation.cfm?id=1281192.1281233

Bibliography continued...

• Jordan, M. I. (1999). Learning in graphical models. MIT Press.• Leicht, E. A., Clarkson, G., Shedden, K., & Newman, M. E. J. (2007). Large-scale structure of time evolving citation networks.

0706.0015. doi: doi:10.1140/epjb/e2007-00271-7• MacKay, D. (2003). Information theory, inference, and learning algorithms. Cambridge UK ;;New York: Cambridge University

Press.• Shibata, N., Kajikawa, Y., Takeda, Y., & Matsushima, K. (2009). Comparative study on methods of detecting research fronts

using different types of citation. J. Am. Soc. Inf. Sci. Technol., 60(3), 571-580.• Shiffrin, R. M. (2004). Mapping knowledge domains. Proceedings of the National Academy of Sciences, 101(suppl_1), 5183-

5185. doi: 10.1073/pnas.0307852100• Torres-Moreno, J., St-Onge, P., Gagnon, M., El-Bèze, M., & Bellot, P. (2009, May 1). Automatic Summarization System

coupled with a Question-Answering System (QAAS). ArXiv e-prints. Retrieved January 11, 2010, from http://adsabs.harvard.edu/abs/2009arXiv0905.2990T

• Zhu D., & Porter A.L.[1]. (2002). Automated extraction and visualization of information for technological intelligence and forecasting. Technological Forecasting and Social Change, 69, 495-506. doi: 10.1016/S0040-1625(01)00157-3

Date post:	08-Jan-2016
Category:	Documents
Upload:	thy
View:	44 times
Download:	1 times

SCIENCE MAPS

Documents