AIRFrame: Astrobiology Integrative Research Framework
Lisa Miller and Rich GazanDepartment of Information and Computer SciencesUH-NASA Astrobiology [email protected], [email protected]
August 5, 2010
Overview● AIRFrame project rationale● AIRFrame project goals● Textpresso system
– Ontology
– Database
– Adaptation to astrobiology
● Current and Future work– Database and Ontology building
– Clustering/Classification
– Browsable visual interface
Nature of Astrobiology● System-level science
– Concerned with complex, multidisciplinary, multi-phenomena behaviors of large physical and biological systems
– Information technology needed to consolidate and represent knowledge and data across many disciplines
● New field– No centralized repositories of knowledge
– No established, standardized vocabulary
Interdisciplinary Collaboration● Proven to be difficult
– Different disciplines have different:● Vocabularies/terminologies● Methods and formats for sharing and
presenting research● Assumed levels of precision
– Institutional and cultural boundaries
● Concern about duplication of research– Lack of access to existing knowledge
– Lack of discipline-specific knowledge to efficiently access what is available
AIRFrame project goals
● Discover and relate diverse research concepts as a high level activity
● Eliminate the need to search for data using combinations of specific keywords
● Show users conceptual and functional relationships between diverse research documents
Astrobiology Integrative Research Framework
Keyword Search Inadequate without a well-defined vocabulary● Keyword searches on Elsevier's ScienceDirect using
some astrobiologically relavant synonyms:
Keywords Number of results An article found only with this search
“prebiotic synthesis” 566 The origin of life – a review of facts and speculations
“prebiotic chemistry” 611 A possible circular RNA at the origin of life
“abiotic synthesis” 312 Recent advances in the chemical evolution and the origins of life
“organic synthesis” 53,434 Prebiotic organic synthesis in early Earth and Mars atmospheres
Keyword Search inadequate for interdiscipinary work
Two searches on Thomson Reuters' ISI Web of Science:
1. astrobio*
- 791 results
- astron* - 23,000+
2. amino acid* Earth
- 940 results
● Only 28 occur in both
● No results from Journal of Theoretical Biology or Origins of Life and Evolution of Biospheres in astrobio* search.
● Some articles with keyword astrobiology assigned by author appear in 2 but not returned with 1
Search: astrobio*
Search: amino acid* Earth
Textpresso-based system● Open-source information retrieval and
extraction system– Developed at CalTech for biological
research
– Currently deployed in 17 different, tightly focused literatures
● Two major components– Database of full-text scholarly documents
– Ontology cataloging types of objects, abstract concepts, and relationships
Ontology● Textpresso uses a shallow ontology to
catalog terms
ConceptsConcepts DescriptionsDescriptions RelationshipsRelationships
Purpose
●Fulfill●Make●Govern●Produce● ...
NucleicAcid
●Adenine●Cytosine●DNA●Guanine● ...
Comparison
●Dissimilar●Equal sized●Like●Related● ...
Database of full-text articles
All Documents
Single Document
Abstract
Body Text
Authors
Title
Individual sentencesWe report the synthesis of glycine on interstellar iceanalog films composed of water,
methylamine (MA), and carbon dioxide ...
XML MarkupWe report the <biologicalprocess>synthesis</biologicalprocess> of <aminoacid>glycine</aminoacid>
on interstellar iceanalog films <action>composed</action> of <volatile><water>water</water></volatile>, methylamine ( MA ), and
<volatile><gases><chemicalelement>carbon</chemicalelement> dioxide</gases></volatile>
Ontology
System queryAllows search by meaning rather than specific keyword
Search by:● keyword,
●keyword + synonyms, ●and/or whole categories
System Query - Results
Results ranked by number of sentenceswith matching terms
Category 2 Keyword Category 1
Category 2Synonym
Adaptation challenges● Previous implementations of Textpresso
have been based on pre-existing ontologies such as the Gene Ontology
● Previous implementations have been much more narrowly focused such as genetics for a single organism
● Most biological journals have open full-text access through a single source, PubMed
Current Work: Ontology Standards Adoption
● To allow more breadth to the ontology – Build a new ontology using SKOS
standard developed by Vocabulary Explorer at IVOA
– Adapt Textpresso to directly read SKOS ontology
● Standards based ontology can– Ease porting existing ontologies to ours
– Allow easy sharing of our data with other systems
Why SKOS?● Allows fast addition of terms from existing sources such
as International Astronomical Union Thesaurus and IVOA.
● http://explicator.dcs.gla.ac.uk/WebVocabularyExplorer/
● Created to allow conversion from other formats
● SKOS allows the kind of relationships we want to leverage with AIRFrame
Current Work:Database Building
● “Proof-of-concept” AIRFrame/Textpresso is currently up and functioning
– www.ifa.hawaii.edu/airframe/textpresso● Workflow
New articles AIRFrameontology
Mark updocuments
&index
Mine for newTerms
Outsideontologies
AIRFramesearchabledatabase
Current Work:Initial Steps in Document
Classification● Want to use machine learning methods to
discover connections in the data– Possibly use several to give users many
views.
● Some work done on phrase-based clustering
● My research area is in information theoretic clustering
● Open to other ideas
Information Bottleneck Method● Developed by Tishby, Pereira, Bialek in
1999● Based on Shannon Information
– Measures the reduction in uncertainty or the distortion between an original signal and its compressed representation
– Where uncertainty is the entropy
I [ x , y ]=H [ x ]−H [ x∣y ]
H [ x ]=∑x
N
p x log 1 / p x
IB Method 2● Data is compressed so that information
about a quantity of interest (in this case words) is kept maximally.
●
● Relative entropy emerges as the distortion function (Kullback-Leibler divergence)
●
● With the optimal assignment rule●
max p c∣x[ I y , c −TI x , c]
DKL[ p y∣x ∣∣p y∣c]=∑y
p y∣x log2 [ p y∣x p y∣c ]
pc∣x =p c Z x ,
exp −1TDKL[ p y∣x∣∣p y∣c]
IB Method 3● Has been shown to be one of the most
accurate unsupervised clustering algorithms
● Downside: slow & computationally expensive
● I am working to establish the optimal number of clusters automatically within the same process using annealing.
Future Plans● Visualizations!
Future plans● Create a cleaner, browsable interface which
displays results and links in an easily understandable way
●Current Interface
- Ugly!- Not intuitive
- Results spread across pages --Hard to see overall picture
Future Plans● Allow users to input a document and get
back information such as
– Related articles– Connections to other researchers– Relavant NASA/NAI goals
Some Ideas● Use document classification and clustering
assisted by the semantic markup● Create clusterings unsupervised and then
use a Support Vector Machine to classify a document input by the user
● Display connections with topic maps by making Textpresso output in ISO Topic Map form
Personas● We envision AIRFrame being able to
present information in dynamic ways such as this:
● http://personas.media.mit.edu/personasWeb.html
● Would an interface like this be valuable?– Notice cannot click through to sites – will
come back to this
LDA● Used by Personas● Returns document as a distribution across
topics– Topics are clusters of single word terms
– Manually labeled for Personas
● Personas uses an SVM trained with the LDA groupings for on-the-fly classification
Snappy Words● WordNet visualizer
http://www.snappywords.com/?lookup=nasa
Open Questions● How best to use our semantic markup to
present our data? ● How to visualize our information to allow
discovery and understanding of connections?
● How best to search for and represent● Category links between terms?● Authorship linkages between projects?● NASA/NAI goal links to research?
● Ontology building– Can it be at least partially automated?
Issues● Speed
– Users need to have feedback if they need to wait.
– Slow unsupervised methods can be run offline during database building
● Users need to be able to do novel searches, not just a few predefined ones.
● Complexity and size of ontology
Thank you!
For updates visit our site at:www.ifa.hawaii.edu/airframe
To test out our textpresso version:www.ifa.hawaii.edu/airframe/textpresso
References and Resources● R.M. Keller, “A Survey of Knowledge Management Research &
Development at NASA Ames Research Center,” 2002.
● I. Yoo, X. Hu, and I. Song, “Biomedical ontology improves biomedical literature clustering performance: a comparison study,” Int. J. Bioinformatics Res. Appl., vol. 3, 2007, pp. 414-428.
● I. Becerra-Fernandez, H. Stewart, and C. Knight, “Developing Distributed Collaboration Systems at NASA: A Report from the Field,” Florida Workshop on Distributed Collaboration, Miami, FL, United States, 2001.
● IVOA Recommendation - An IVOA Standard for Unified Content Descriptors. http://www.ivoa.net/Documents/latest/UCD.html
● K.W. Boyack, K. Mane, and K. B?rner, “Mapping Medline Papers, Genes, and Proteins Related to Melanoma Research,” Information Visualisation, International Conference on, Los Alamitos, CA, USA: IEEE Computer Society, 2004, pp. 965-971.
● The Web and SKOS,http://www.iskouk.org/presentations/miles_web_and_skos_2 00807.pdf
References and Resources“Omnigator Vizigator.” http://www.ontopia.net/omnigator/plugins/viz/viz.jsp?tm=opera.ltm&id=puccini&redirect=%2Fomnigator%2Fmodels%2Ftopic_complete.jsp%3Ftm%3Dopera.ltm%26id%3Dpuccini
“Ontologies Come of Age.” http://www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-mit-press-%28with-citation%29.htm
M. Uschold and M. Gruninger, “Ontologies: Principles, methods and applications,” To appear in Knowledge Engineering Review, vol. 11, 1996.
N.F. Noy and D.L. McGuinness, Ontology development 101: A guide to creating your first ontology, Citeseer, 2001.
PANDORA: Programs for AstroNomical Data Organization Reduction and Analysis. http://cosmos.iasf-milano.inaf.it/pandora/
Textpresso: An Ontology-Based Information Retrieval and Extrac tion System for Biological Literature. http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0020309
References and Resources● Alasdair J.G. Gray, Norman Gray, Christopher W. Hall, Iadh Ounis,
Finding the right term: Retrieving and exploring semantic concepts in astronomical vocabularies, Information Processing & Management, Volume 46, Issue 4, Semantic Annotations in Information Retrieval, July 2010, Pages 470-478, ISSN 0306-4573, DOI: 10.1016/j.ipm.2009.09.004.
● SKOS Simple Knowledge Organization System Primer. http://www.w3.org/TR/2009/NOTE-skos-primer-20090818/
● SKOS Simple Knowledge Organization System Reference. http://www.w3.org/TR/skos-reference/
● SemanticOrganizer: A Customizable Semantic Repository for Distributed NASA Project Team 767 – 781,2004, Richard M. Keller, Daniel C. Berrios, Robert E. Carvalho, David R. Hall, Stephen J. Rich, Ian B. Sturken, Keith J. Swanson, Shawn R. Wolfe http://www.springerlink.com/content/QJ6TNF41MA7EC52U
● “The Astronomy & Astrophysics Keyword List.” http://www.ivoa.net/rdf/Vocabularies/vocabularies-20091007/AAkeys/AAkeys.html
● .
References and Resources● “The National Virtual Observatory: Tools and Techniques for
Astronomical Research – aspbooks.org.23 A.P. Martinez, S. Derriere, A.P. Martinez, S. Derriere, N. Delmotte, N. Gray, R. Mann, J. McDowell, T. Mc Glynn, F. Ochsenbein, and others, “The UCD1+ controlled vocabulary Version 1.23.”
● S.K. Card, “Visualizing Retrieved Information: A Survey,” IEEE Computer Graphics and Applications, vol. 16, 1996, pp. 63-67.
● A.J. Gray, A.P. Martinez, and I. INAF, “Vocabularies in the Virtual Observatory Version 1.19.” http://www.ivoa.net/Documents/REC/Semantics/Vocabularies-20091007.pdf
● “What is an ontology and why we need it” http://protege.stanford.edu/publications/ontology_development/ontology101-noy-mcguinness.html
● Canas, A. J, and Carvalho, M.: Concept Maps and AI: an Unlikely Marriage?, Anais do Simpósio Brasileiro de Informática na Educaç ao, volume 1, 1, 2004
References and Resources● Canas, A. J Ca, Hill, G., Carff, R., Suri, N., Lott, J., Eskridge, T.,
Gómez, G., Arroyo, M., and Carvajal, R.: CmapTools: A knowledge modeling and sharing environment, Concept maps: Theory, methodology, technology. Proceedings of the first international conference on concept mapping, volume 1, 125–133, 2004
● Blei, David M., and Lafferty, John D.: A correlated topic model of Science, The Annals of Applied Statistics 1(1), volume 1, 17–35, 2007
● Blei, David M., Ng, Andrew Y., and Jordan, Michael I.: Latent dirichlet allocation, J. Mach. Learn. Res. 3, volume 3, 993–1022, 2003
● Han, Jiawei, and Kamber, Micheline: Data mining: concepts and techniques, Morgan Kaufmann, 772, 2006
● S. Still and W. Bialek. How many clusters? An information theoretic perspective. Neural Computation, 16(12):2483-2506, 2004.
● S. Still. Information theoretic approach to interactive learning. EPL 85 (2009) 28005.