+ All Categories
Home > Documents > Information Discovery on Vertical Domains Vagelis Hristidis Assistant Professor School of Computing...

Information Discovery on Vertical Domains Vagelis Hristidis Assistant Professor School of Computing...

Date post: 18-Dec-2015
Category:
Upload: colleen-julianna-wilkerson
View: 219 times
Download: 2 times
Share this document with a friend
Popular Tags:
40
Information Discovery on Vertical Domains Vagelis Hristidis Assistant Professor School of Computing and Information Sciences Florida International University (FIU), Miami
Transcript

Information Discovery on Vertical Domains

Vagelis HristidisAssistant ProfessorSchool of Computing and Information SciencesFlorida International University (FIU), Miami

Need for Information DiscoveryAmount of available data increasesNeedle in the haystack problemSome applications:

◦ Web◦ Desktop search◦ Data Warehousing◦ Bibliographic database◦ Homes, cars search, e.g., realtor.com,

autotrader.com◦ Scientific domains, e.g.,

genes, proteins, publications in biology, elements and interactions of components in chemistry Patient hospitalizations, physician info, procedure

outcomes in hospitalsVagelis Hristidis - FIU - Information Discovery on Vertical Domains 2

Strengths and Limitations of Current Approaches

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 3

Web Search+ Scalability+ Handle free text+ Exploit content and link structure to achieve

ranking+ Simple keyword queries- Limited query expressive power- Generic, domain-independent ranking algorithms- Return pages, not answers

Database Querying+ Efficient

+ Handle structured data

+ Well-defined theory and answers

- Must learn query language, e.g. SQL

- No automatic ranking of results Keyword Search in Databases + Simple keyword queries

+ exploit links (e.g., primary-foreign keys)

- Generic ranking – typically size of result

- No domain semantics

p1: person[name="John"nation="US"]

l1: lineitem[quantity=10

shipdate=Oct 14 2001]

l2: lineitem[quantity=10

shipdate=Oct 15 2001]

pa3: part[partkey=1005name="TV"]

pa1: part[partkey=1008name="VCR"]

pa2: part[partkey=1009

name="VCR & DVD"]

Research ObjectiveAllow effective and efficient information

discovery on vertical domainsStrategy:

◦ Exploit associations between entities◦ Model domain semantics, e.g., patient entity is

critical for medical practitioner, but not for biologist

◦ Model users of a domain◦ Use knowledge of domain experts,and existing

knowledge structures (e.g., domain ontologies)◦ Exploit user feedback◦ Go beyond plain keyword search. Explore best

search interface for each domain, e.g., faceted search

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 4

Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 5

Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 6

Products MarketplaceProject started while

visiting Microsoft Research at Redmond, in Summer 2003

SQL Returns Unordered Sets of Results

Overwhelms Users of Information Discovery Applications

How Can Ranking be Introduced, Given that ALL Results Satisfy Query?

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 7

8

Products Marketplace (cont’d)Example – Realtor Database

House Attributes: Price, City, Bedrooms, Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year

Query: City =`Seattle’ AND Waterfront = TRUE

Too Many Results!Intuitively, Houses with lower Price,

more Bedrooms, or BoatDock are generally preferable

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains

vagelis
All results have same query specified attrs since we focus on conjunctive queries as I explain later.

9

Products Marketplace (cont’d)Rank According to Unspecified Attributes [VLDB’04,TODS’06]

Score of a Result Tuple t depends onGlobal Score: Global Importance of

Unspecified Attribute Values◦ E.g., Newer Houses are generally preferred

Conditional Score: Correlations between Specified and Unspecified Attribute Values◦ E.g., Waterfront BoatDock

Many Bedrooms Good School District

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains

vagelis
So since all result tuples have identical qury-specified attr values...as introduced in MS CIDR03

10

Products Marketplace (cont’d)Key ProblemsGiven a Query Q, How to

Combine the Global and Conditional Scores into a Ranking Function.Use Probabilistic Information Retrieval (PIR).

How to Calculate the Global and Conditional Scores.Use Query Workload and Data.

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains

vagelis
In particular, I will show that the global and conditional parts of the ranking naturally appear by adapting PIR ranking techniques to our problem.

Products Marketplace (cont’d)Other ProjectsSelect the best attributes to output –

attribute ordering problem [SIGMOD’06]◦ E.g., Color is important for sports cars but

not much for family carsProduct Advertising: Select best

attributes to display for a product to maximize its visibility among its competitors [ICDE’08, TKDE’09]◦ Use past query workload◦ Maximize number of past queries for which

the product is returnedVagelis Hristidis - FIU - Information Discovery on Vertical Domains 11

Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 12

Biological Databases [EDBT’09]

With University of Maryland Intuitive but powerful query

language, based on soft (ranking) and hard (pruning) filters

Goal is to improve the user experience of users of PubMed

Exploit associations between entities (genes, proteins, publications)

Example of Query: Find the most important publications on “cancer” that are related to the “TNF” gene through a protein.

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 13

Results Navigation in PubMed with BioNav [ICDE’09, TKDE’10]

With SUNY Buffalo.Most publications in PubMed

annotated with Medical Subject Headings (MeSH) terms.

Present results in MeSH tree.Propose navigation model and

smart expansion techniques that may skip tree levels.

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 14

BioNav: Exploring PubMed Results

Static Navigation Treefor query “prothymosin”

MESH (313)Amino Acids, Peptides, and Proteins (310)

Proteins (307)Nucleoproteins (40)

Biological Phenomena, … (217)Cell Physiology (161)

Cell Growth Processes (99)

Genetic Processes (193)Gene Expression (92)

Transcription, Genetic (25)

95 more nodes

2 more nodes45 more nodes

4 more nodes

3 more nodes15 more nodes

10 more nodes1 more node

Histones (15)

- Query Keyword: prothymosin

- Number of results: 313

- Navigation Tree stats:

• # of nodes: 3941• depth: 10• total citations: 30897

Big tree with many duplicates!

15Vagelis Hristidis, Searching and Exploring Biomedical Data

16

BioNav: Exploring PubMed Results

Reveal to the user a selected set of descendent concepts that:(a) Collectively contain all results(b) Minimize the expected user navigation costNot all children of the root are necessarily revealed as in static navigation.

Vagelis Hristidis, Searching and Exploring Biomedical Data

Vagelis Hristidis, Searching and Exploring Biomedical Data 17

BioNav Evaluation

02468

101214161820

Overall Navigation Cost(# of Concepts Revealed + # of EXPAND Actions)

Static BioNav

Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 18

Vagelis Hristidis, Searching and Exploring Biomedical Data

XOntoRank: Use Ontologies to Search Electronic Medical Records [ICDE’09]

With Miami Children’s Hospital, Indiana University School of Medicine, IBM Almaden.

Latest EMR format: HL7 CDA – XML-based Algorithm to enhance keyword search using ontological

knowledge (e.g., SNOMED)

19

Medical DictionaryM

edic

al D

icti

onar

y

50043002Disorder of

Respiratory system

79688008RespiratoryObstruction

Is a

118946009Disorder of

Thorax

41427001Disorder ofBronchus

Is a

195967001Asthma

Is a

Is a

301229001Bronchial Finding

Is a

405944004AsthmaticBronchitis

Is a

May be

266364000Asthma attack

Is aMay be

955009Bronchial Structure

Finding site of

Finding site of

Finding site of

82094008Lower respiratory tract

structure

Is a

Vagelis Hristidis, Searching and Exploring Biomedical Data 20

SAMPLE CDA FRAGMENT

Vagelis Hristidis, Searching and Exploring Biomedical Data

XOntoRank: Example 1

q = {“bronchitis”, “albuterol”}

result = Observationcodevalue Bronchitisvalue Albuterol

21

Vagelis Hristidis, Searching and Exploring Biomedical Data

XOntoRank: Example 2

q = {“asthma”, “albuterol”}

result = ???

22

Vagelis Hristidis, Searching and Exploring Biomedical Data

XOntoRankA CDA node may be associated to a

query keyword w through ontology.XOntoRank first assigns scores to

ontological concepts◦ OntoScore OS(): Semantic relevance of a

concept c in the ontology to a query keyword w.

Then, given these scores, assign Node Scores NS() to document nodes

Other aggregation functions are possible.

23

Vagelis Hristidis, Searching and Exploring Biomedical Data

Computing OntoScore of Concept Given Query KeywordThree ways to view the ontology

graph:◦As an unlabeled, undirected graph.◦As a taxonomy.◦As a complete set of relationships.

24

Vagelis Hristidis, Searching and Exploring Biomedical Data 25

Authority Flow Ranking in EMRs

A subset of the electronic health record dataset.

Work under submission.

EventsPlan TimeStampCreated=”2004-11-03 11:57:00.0" Events=”….small residual pericardial effusion…..”

Hospitalization TimeStampCreated=”2004-10-27 22:00:00.0" History=”18 year old boy with an aggressive form of chest lymphoma…” Allergies = “NKDA”…...

Cardiac PatientID=”1438" Complication=”apical impulse … Echo-large increasing pericardial effusion…”

Employee TimeStampCreated=”2004-12-23 14:03:00.0" Title=”Pediatric Cardiologist”….

EventsPlan Events=“4 month old baby… pericardial effusion...”

Medication TimeStampCreated=”2003-02-13 21:57:00.0"..

Hospitalization History = “48 year old..”

v1v7

v2v3

v4

v5v6

prescribed_to

recorded_by

recorded_by

Query: “pericardial effusion”

Vagelis Hristidis, Searching and Exploring Biomedical Data 26

ObjectRank on EMRs: Authority Flow Ranking

Schema of the EMR dataset

Hospitalization

EmployeeAssociated_

Events

Patient Medication

A-E

P-M H-M

M-E

A-H H-E

P-E

created_by

reco

rded

_by

pres

crib

ed_b

y

of prescribed_to

forcreated_by

Vagelis Hristidis, Searching and Exploring Biomedical Data 27

User Study

Vagelis Hristidis, Searching and Exploring Biomedical Data 28

Explaining Subgraph

Vagelis Hristidis, Searching and Exploring Biomedical Data 29

User Study Results

00.10.20.30.40.50.60.70.80.9

1

CO085BM25 BM25 CO085 CO030

Ave

rag

e S

ensi

tivity

00.10.20.30.40.50.60.70.80.9

1

CO085BM25 BM25 CO085 CO030

Ave

rage

Spe

cific

ity

Mean Sensitivity Mean Specificity

BM25: Traditional Information Retrieval Ranking FunctionCO: Clinical ObjectRank (Authority Flow)

Other challenges of Searching EMRs [NSF Symposium on Next Generation of Data Mining ’07]

Entity and Association Semantics

Negative StatementsPersonalizationTreatment of Time and

Location AttributesFree Text Embedded in CDA

Document

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 30

Vagelis Hristidis, Searching and Exploring Biomedical Data

Syntax vs. Semantics in Schema

31

Example – query “Asthma Theophylline”

More details at [Hristidis et al. NSF Symposium on Next Generation of Data Mining ’07]

Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 32

Bibliographic Databases Work started while at UCSD Exploit citations link structure to create query

specific ranking [VLDB’04, TODS’08] Demo available for Database literature at

http://dbir.cs.fiu.edu/BibObjectRank

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 33

Bibliographic Databases (cont’d)Query Reformulation

Work with U of Maryland [ICDE’08]

Based on user selected resultsPerform query expansion –

add/change weight of query keywords

Adjust authority flow weightsCurrently working on applying

these ideas to queries on PubMed.Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 34

Explaining Query Results – Explaining Subgraph

Target Object: “Modeling Multidimensional databases” paper.Explaining Subgraph Creation1. BFS in reverse direction from target object.2. BFS in forward direction from base set objects (authority

sources).3. Subgraph contains all nodes/edges traversed in forward

direction.4. Compute explaining authority flow along each edge by

eliminating the authority leaving the subgraph (iterative procedure).

5. Structure-based reformulation: High-flow edges in explaining subgraph receive weight boost.

Paper Authors=“H. Gupta, V. Harinarayan, A. Rajaraman, J. Ullman” Title=“Index Selection for OLAP.” Year=“ICDE 1997”

Paper Authors=“C. Ho, R. Agrawal, N. Megiddo, R. Srikant” Title=“Range Queries in OLAP Data Cubes.” Year=“SIGMOD 1997”

Paper Authors=“R. Agrawal, A. Gupta, S. Sarawagi” Title=“Modeling Multidimensional Databases.” Year=“ICDE 1997”

Author Name=“R. Agrawal”

Year Name=“ICDE”, Year=1997, Location=Birmingham

1.59e-7

6.76e-6

1.48e-4

7.12e-6

2.37e-6

3.02e-4 1.0e-4

0.001 6.76e-6

Conference Name=“ICDE”

7.12e-7

9.55e-7

v1

v2

v3

v4v5

v6

TARGET OBJECT

Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 36

Search PatentsSpecial characteristics of

patents: Patents are organized

into classes and subclasses.

Patents have links to external publications and to other patents.

Patents are organized to various sections (abstract, claims, description and images).

Patents use specific legal wording in the claims section. Further, claims have references to other claims, that is, claims can be viewed as a graph.

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 37

Demo at PatentsSearcher.com

End - Thank YouFor more information, please go

to:http://ww.cis.fiu.edu/~vagelis

Supported by ◦NSF CAREER, 2010-2015◦NSF grant IIS- 0811922: III-CXT-

Small: Information Discovery on Domain Data Graphs, 2008-2011

◦DHS grant 2009-ST-062-000016: Information Delivery and Knowledge Discovery for Hurricane Disaster Management, 2009-2011

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 38

Extra Slides

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 39

Vagelis Hristidis, Searching and Exploring Biomedical Data 40

CDA Document – Tree View


Recommended