+ All Categories
Home > Documents > Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and...

Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and...

Date post: 19-Dec-2015
Category:
View: 221 times
Download: 3 times
Share this document with a friend
Popular Tags:
38
Technology for Technology for integrated access integrated access and discovery and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February 5, 2004
Transcript
Page 1: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Technology for Technology for integrated access integrated access

and discoveryand discovery

Presented by: Marc KrellensteinTitle: VP, Search and DiscoveryAdvanced Technology GroupDate: February 5, 2004

Page 2: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Basic search is pretty good Basic search is pretty good Basic search is pretty good Basic search is pretty good

• Modern search engines are fast and scalable– Having the data (usually lots) is still key

• Can interpret keyword, Boolean and pseudo-natural language queries– Ex: “how to make an international call with

my Blackberry”• Spell checking, thesauri and stemming to

improve recall• Users are more experienced

– More multi-term searches• Gets lots of hits, but that’s usually OK if

good ones on top

Page 3: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Basic search is pretty goodBasic search is pretty goodBasic search is pretty goodBasic search is pretty good• Best practice relevancy ranking is good:

– Term frequency (TF): more hits count more– Inverse document frequency (IDF): hits of rarer

search terms count more • Ex: diabetes diagnosis and treatment

– Hits of search terms near each other count more

• Ex: penicillin allergy vs. “penicillin allergy”– Hits on metadata (title,subject, etc.) count

more• Use anchor text – referring text – as metadata

– Items with more links/references to them count more

• Authoritative links/referrers count yet more– Many other factors: length, date, etc.

Page 4: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Basic search is pretty goodBasic search is pretty goodBasic search is pretty goodBasic search is pretty good

• Using these techniques search engines can locate specific documents, or good documents (if not the absolute best) around general or specific topics

• But challenges remain…

Page 5: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Current challengesCurrent challengesCurrent challengesCurrent challenges

• Integrated search: Content still exists in separate silos– Silos getting bigger but there are still too

many– Library patrons have dozens of choices – Putting even more into Google is probably not

sufficient to solve the problem• Finding the best/novel documents

– Hard to perform complicated searches (e.g., research similar to one’s own)

• Historians can’t define a profile…

• Discovery– Hard to do more than search: summarize,

uncover novelty and relationships, analyze

Page 6: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

The integration challengeThe integration challengeThe integration challengeThe integration challenge

• Two approaches:– Build even bigger databases (well,

yes…)• Not easy, but sometimes the easiest

approach• Can be difficult to manage and secure

appropriate rights

– Distribute search: Search separately managed (or owned) large databases as if they are one

• Technically more challenging, but a scalable and maintainable architecture

Page 7: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Distributed searchDistributed searchDistributed searchDistributed search

• Index multiple (maybe geographically) separate databases with a single search engine that supports distributed search– Use common metadata scheme (e.g., Dublin

Core) and/or determine other common fields or field mappings for each database

– Search engine provides parallel search, integrated ranking and integrated results

– The separate databases can be maintained and updated separately

– Elsevier is currently unifying its own sources in such a model with a ‘web service’ architecture

• Has contributed specifications to the public domain– Such services can also be offered externally

Page 8: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Distributed searchDistributed searchDistributed searchDistributed search• Simplifies some business issues, but still

requires common technology platform• Where common platform not possible, add

federated search (i.e., metasearch)– Translate queries– Access and perform parallel search of multiple

search engines (vs. multiple databases)– Integrate results as best as possible– Use standards to approximate distributed

research • Uniform access, one query language (Z39.50,

updated)• Add standards for relevancy ranking and results

return?• NISO and its members are working on standards

Page 9: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Finding the best: NavigationFinding the best: NavigationFinding the best: NavigationFinding the best: Navigation

• More data can also make finding the best or novel documents harder– For searches for rare items, more data is a win– For all other searches, it’s more likely your

answer is in there…but it’s also more likely there’s lots of other stuff close but not as good

• Why? relevancy is good but…• Relevancy has its limits…there may be

many ‘good’ documents referring to different aspects of the search…the best?

• Underlying problems: – User’s needs may not be that specific– Even long searches are under-specified

Page 10: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

One solution: clustering One solution: clustering documentsdocumentsOne solution: clustering One solution: clustering documentsdocuments

• Group results around common themes: same subject, author, web site, journal,…

• Show largest/most interesting categories • Depression psychology, economics,

meteorology, antiques…– Psychology treatment of depression,

depression symptoms, seasonal affective…– Psychology Kocsis, J. (10), Berg, R. (8), …

• Themes could come from static metadata or dynamically by analysis of results text– Static: fixed, clear categories and assignments– Dynamic: doesn’t require metadata (or

controlled vocabulary to draw from)

Page 11: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Clustering benefitsClustering benefitsClustering benefitsClustering benefits• Disambiguates and refines search results to

get to documents of interest quickly• Can navigate long result lists hierarchically

– Would never offer thousands of choices to choose from as input…

– Access to bottom of list…maybe just less common

• Discovery – new aspects or sources• Can narrow results *after* search

– Start with the broadest area search – don’t narrow by subject or other categories first

– Easier, plus can’t guess wrong, miss useful, or pick unneeded, categories…results-driven

• Knee surgery cartilage replacement, plastics, …

Page 12: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.
Page 13: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.
Page 14: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Finding the best: Complex Finding the best: Complex searchsearchFinding the best: Complex Finding the best: Complex searchsearch

• Main problem is still short searches/under-specification….which the keyword-based ‘enter a query’ paradigm encourages

• One solution: Relevance feedback – marking good and bad results

• A long-standing and proven search refinement technique– More information is better than less (longer

queries are better)– Pseudo-relev feedback is a research standard

• Commercial forms – find-similar, etc. --– not widely used (or well executed)...

• …but successful in Pubmed (diff users)

Page 15: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Relevance feedbackRelevance feedbackRelevance feedbackRelevance feedback

• One catch: Must first find a good document to be similar to

• Solution: Let the user provide the ideal document – or a long query or problem statement – as input in the first place– Can enter free text or specific documents

describing the interest, e.g., article, grant proposal, experiment description, etc.

– Should provide the best possible matches

Page 16: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Discovery challenge: Beyond Discovery challenge: Beyond searchsearchDiscovery challenge: Beyond Discovery challenge: Beyond searchsearch

• How do you summarize a corpus?– May want to report on what’s present,

numbers of occurrences, trends, etc.– Ex: What diseases are studied the most?– Must know all diseases and look one by one

• How to you find a relationship if you don’t know what relationships exist?– Ex:does gene p53 relate to any disease?– Must check for each possible relationship

• Ad hoc analysis– How do all genes relate to this one disease?

Over time? What organisms have the gene been studied in? Show me the document evidence…

Page 17: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

One solution: entity extractionOne solution: entity extractionOne solution: entity extractionOne solution: entity extraction• Identify entities (things) in a text corpus

– Examples: authors, universities… diseases, drugs, side-effects, genes…companies, law suits, plaintiffs, defendants…

– Use lexicons, patterns, NLP for finding any or all instances of the entity

• Identify relationships:– Through co-occurrence

• Relationship presumed from proximity• Example: author-university affiliation

– Through limited limited natural language processing

• Semantic relations – causes, is-part-of, etc.• Examples: drug-causes-disease…drug-is treatment

for-disease…a is suing b…

Page 18: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

ClearForest pilot, Fall 2002ClearForest pilot, Fall 2002ClearForest pilot, Fall 2002ClearForest pilot, Fall 2002

• Goal: Demonstrate real value to a working expert in 90 days

• Chose biomedical domain• Hired expert to help define entities and

relationships• Used 25,000 abstracts from 23 Elsevier

journals• Worked with ClearForest to define and

revise extraction of entities and relationships

• Have related partnership with Stanford for text mining

Page 19: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Pilot scenariosPilot scenariosPilot scenariosPilot scenarios

• Answered real questions using real data – not a demo or mock-up

• The user:– anyone involved in genomic academic

research: a primary researcher, graduate student or post-doc

• Scenario 1: Research about gene p53– What journals should I publish in? – Who’s an expert I can ask for advice? – What connections have been made to my

gene?– What organisms have my gene?

Page 20: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

What journals should I publish in?

Page 21: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Who’s an expert?

Page 22: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Connections to p53?

Page 23: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

To organisms?

Page 24: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Pilot scenariosPilot scenariosPilot scenariosPilot scenarios

• Scenario 2: Disease research– What diseases are most researched?– What’s the time trend in HIV research?– What are the centers of HIV research?– Who are the author teams in HIV?– What gene-disease relationships are

there? What were they to start in 1996? through 1997?

– (Note: Cannot answer the above with search alone)

Page 25: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

What diseases are most researched?

Page 26: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Time trend in HIV research?

Page 27: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Centers of HIV research?

Page 28: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Author teamsIn HIV research?

Page 29: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Gene-disease relationships?

Page 30: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

To start, in 1996?

Page 31: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Through 1997?

Page 32: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Pilot scenariosPilot scenariosPilot scenariosPilot scenarios

• Scenario 3: Connections between leukemia and Alzheimer’s– Are there direct connections between

leukemia and Alzheimer’s?– What enzymatic activity is associated

with leukemia?– Are there indirect connections

between leukemia and Alzheimer’s mediated by enzymatic activity?

Page 33: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Direct connections between leukemia and Alzheimer’s?

Page 34: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Enzymes associated with leukemia?

Page 35: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Indirect links fromleukemia to Alzheimer’s via enzymes

Page 36: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

The power of indirect links The power of indirect links The power of indirect links The power of indirect links

• Almost impossible to determine manually

• Can provide completely unexpected relationships between source and target

Page 37: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

The value of analyticsThe value of analyticsThe value of analyticsThe value of analytics

• Goes beyond search – summarizes, shows relationships, answers complex questions

• A significant value-added service– Value of one new drug discovery?

Page 38: Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

SummarySummarySummarySummary

• Need to search more broadly, more easily – Larger databases– Distributed search

• Need to locate best/novel documents in even larger (distributed) databases– Clustering to find documents of real interest– Find/similar, descriptive search

• Need to go beyond search for overviews, relationships and discovery– Text-based data mining and entity extraction


Recommended