Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM...

Post on 19-Jan-2016

214 views 1 download

Tags:

transcript

Visualization of Relational Text Visualization of Relational Text InformationInformation

for Biomedical Knowledge Discovery

James W. Cooper

IBM T J Watson Research Center

Hawthorne, NY

Overview Overview

Prior workJava based text miningComputation of unnamed relationsGraphical display of relations

Text

Text

Text

TextText

TextText

Text

Text

Relations between termsRelations between terms Noun phrase co-occurrence statistics [Roark,

Charniak] Choose seed words and look for terms near them.

[Brin] [Gravano, Agichtein]– Repeat

Biomedical domain– Blaschke used dictionary of common verbs– Pustejovsky found inhibit relations

Stevens, Palakal, Mostafa– Detected abstract-wide co-occurrence using

dictionary of genes and useful verbs.

Graphical DisplaysGraphical Displays

Biolayout – protein similarityProtInAct – interactive system using yFilesZhang – interactive 3D systemJenssen – gene network Leroy – GeneScene

BioLayout –Enright and OuzounisBioLayout –Enright and Ouzounis

Spheres represent proteins and lines represent protein similarities.

Five related protein families and their corresponding relationships.

ProInAct- Spencer and BennettProInAct- Spencer and Bennett

Proteins clustered by functional interaction

Zhang-Protein interaction mappingZhang-Protein interaction mapping

Jenssen – A literature networkJenssen – A literature network

Lines connect genes that have co-occurred in 1 or more papers.

Leroy –GeneSceneLeroy –GeneScene

What would we like to do?What would we like to do?

Find scientifically meaningful connections between important terms.– Such as Swanson’s Reynaud’s disease – fish

oil connection.Allow exploration of relations by user.Filter the relations by ontology or term

typesPerform path analysisLet the user vary the graphical display.

Data we analyzedData we analyzed

Two sets of patent data– 584 patents on Viagra and phosphodiesterase

inhibitors.– 1514 patents on quinolones (like Cipro)

Recognized major technical terms in each patent.

Filtered organic chemical nomenclature.

The Talent text mining systemThe Talent text mining system

Text Analysis and Language Engineering Tools– Finds multiword noun phrases– Does shallow parse– Can extract NPs and VGs

As well as all other sentence parts

The JTalent LibraryThe JTalent Library

Java class library with JNI interface– To Talent DLL

Creates database load files of terms– Paragraph– Sentence– Offset– Term type (NP, VG)

TalentShow DemoTalentShow Demo

The KSS LibraryThe KSS Library

Java class library of functions for– Accessing a database (DB2, Access)– Manipulating a search engine– Manipulating tables of information created by

JTalent.

Database TablesDatabase Tables

Documents– Title, author, URL, ID

TermDocs– Term– Paragraph– Sentence– Offset– Type

Dictionary of terms, types and IDs– Such as MeSH

Computing term informationComputing term information

Compute unique terms from TermdocsCompute frequencyCompute salience

– Based on frequency– Number of docs they appear in more than

once

Compute term relationsCompute term relations

Named relations based on abbreviation expansions.

Unnamed relations based on proximity, with weight based on how frequently they occur near each other.

Mutual information weight:

21

logfreqfreq

paircounttotaltermsm

Tuning Computed relationsTuning Computed relations

Select only terms above a salience threshold.

Only relations in which one or both are members of an ontology.

Store relations in a database table for rapid access:

Term | weight | term

Original SystemOriginal System

Visual clientSOAP server

– Queries database to get relations– Round trip for each new query

Instead, we export the data for the user to visualize as they wish.

Exporting relationsExporting relations Save relations and ontology information in xml file. <relation>

– <term> <iq>78</iq> <source>MeSH</source> <relationDocuments>

– <doc> 34</doc– </term>– <term> </term>

</relation> This XML file is a portable version of the computed

relations that we can then use with any number of viewers.

A Graphical Relations ViewerA Graphical Relations Viewer

Creates a Java Relations object for each relation it reads from the XML file.

Inserts them into a Trie structure based on lower cased first term.– If there is already a Relation at that point, it

adds them to a Vector for that term.Creates an alphabetical list of all terms in a

2nd Trie.

Using the ViewerUsing the Viewer

When you enter part of a term, it shows all terms starting with that fragment in the left list box.

When you click on a term, it shows all its relations in the right list box.

Lexical NavigationLexical Navigation

Displays relations between terms graphically and allows you to explore them without formulating a specific query.

Possible enhancementsPossible enhancements

Show only terms belonging to an ontology.Show only higher IQ termsShow the documents the relations occur in.Show the ontology reference.Show computed pathsShow more kinds of named relations.

– Inhibits, expresses

Evaluations of Information Evaluations of Information VisualizationVisualization Few, if any, graphical displays have been

evaluated thus far for effectiveness. Usability studies are hard to construct and carry

out. Intuition seems to show

– that exploration may result in discoveries.– Relations more than one step apart seem best

displayed graphically. Remains to be shown that such visualizations are

actually useful.

Differences in IntentDifferences in Intent

Displays may represent information your system has discovered.– Gene – protein relations

Or they may represent data from which the user may discover new information.– New 2nd or 3rd order relationships

These are rather different applications of visualization technology

SummarySummary

Java-based text mining systemDatabase of terms and positionsComputation of relationsExport as XMLGraphical relations viewerThe value of such visual interfaces has not

yet been established.

AcknowledgementsAcknowledgements

Bhavani Iyer – XML exportEric Brown – DictMatcher hash codeDaniel Tunkelang – graphical layoutBob Mack – paper suggestions