+ All Categories
Home > Documents > Next generation search

Next generation search

Date post: 04-Jan-2016
Category:
Upload: stephen-potts
View: 17 times
Download: 0 times
Share this document with a friend
Description:
Next generation search. Marc Krellenstein VP, Search and Discovery Elsevier August 23, 2004 [email protected]. Basic search is pretty good. Modern search engines are fast and scalable Having the data (usually lots) is still key - PowerPoint PPT Presentation
Popular Tags:
45
Next generation Next generation search search Marc Krellenstein VP, Search and Discovery Elsevier August 23, 2004 [email protected]
Transcript
Page 1: Next generation search

Next generation Next generation searchsearch

Marc KrellensteinVP, Search and DiscoveryElsevierAugust 23, [email protected]

Page 2: Next generation search

Basic search is pretty good Basic search is pretty good Basic search is pretty good Basic search is pretty good

• Modern search engines are fast and scalable– Having the data (usually lots) is still key

• Can interpret keyword, Boolean and pseudo-natural language queries– Ex: “how to make an international call”

• Spell checking, thesauri and stemming to improve recall (and sometimes precision)– Recall = % of relevant documents found– Precision = % of returned documents that are

relevant• Get lots of hits, but that’s usually OK if

there are good ones on top

Page 3: Next generation search

Basic search is pretty goodBasic search is pretty goodBasic search is pretty goodBasic search is pretty good• Best practice relevancy ranking is good:

– Term frequency (TF): more hits count more– Inverse document frequency (IDF): hits of rarer

search terms count more • Ex: diabetes diagnosis and treatment

– Hits of search terms near each other count more

• Ex: penicillin allergy vs. “penicillin allergy”– Hits on metadata (title,subject, etc.) count

more• Use anchor text – referring text – as metadata

– Items with more links/references to them count more

• Authoritative links/referrers count yet more– Many other factors: length, date, etc.

Page 4: Next generation search

Basic search is pretty goodBasic search is pretty goodBasic search is pretty goodBasic search is pretty good

• Using these techniques search engines can locate specific documents, or good documents (if not the absolute best) around general or specific topics

• But challenges remain…

Page 5: Next generation search

Current challengesCurrent challengesCurrent challengesCurrent challenges• Integrated search: Content still in silos

– Silos getting bigger but there are still dozens• Finding the best (not just good)

documents• Answering hard questions

– Hard to match multiple criteria• find an experimental method like this one

– Hard to get answers to complex questions, • patient X with pre-existing conditions Y presents

with Z…what information is relevant?

• Summary, discovery and analysis– Summarize, uncover relationships, analyze

• Long-term: understand any question…

Page 6: Next generation search

The integration challengeThe integration challengeThe integration challengeThe integration challenge

• Two approaches:– Build bigger databases

• Sometimes the easiest way…• …but can be difficult or impossible to

secure appropriate rights and consolidate data

– Distributed search: Search separately managed (or owned) large databases as if they are one

• Technically more challenging, but a scalable and maintainable architecture

Page 7: Next generation search

Distributed searchDistributed searchDistributed searchDistributed search

• Index multiple (maybe geographically) separate databases with a single search engine that supports distributed search– Use common metadata scheme (e.g., Dublin

Core) and/or determine other common fields or field mappings for each database

– Search engine provides parallel search, integrated ranking and integrated results

– The separate databases can be maintained and updated separately

– Elsevier is currently unifying its own sources in such a model with a ‘web service’ architecture

– Such services can also be offered externally

Page 8: Next generation search

Distributed searchDistributed searchDistributed searchDistributed search• Simplifies some business issues, but still

requires common technology platform• Where common platform not possible, can

use federated search (i.e., metasearch)– Translate queries– Access and perform parallel search of multiple

search engines (vs. multiple databases)– Integrate results as best as possible– Use standards to approximate distributed

research • Uniform access, one query language (Z39.50,

updated)• Add standards for relevancy ranking and results

return?• NISO and its members are working on standards

Page 9: Next generation search

Finding the bestFinding the bestFinding the bestFinding the best• More data can also make finding the best

documents harder– For searches on rare items, more data is a win– For all other searches, it’s more likely your

answer is in there…but can be a problem too– Why? relevancy is good but…

• Relevancy has its limits– “I need information on depression”– “Ok…here are 2,352 articles and 87 books”

• Need a dialog…”what kind of depression” …”psychological”…”what about it?”

• Underlying problem: most searches are under-specified

Page 10: Next generation search

One solution: clustering One solution: clustering documentsdocumentsOne solution: clustering One solution: clustering documentsdocuments

• Group results around common themes: same author, web site, journal, subject…

• Blurt out largest/most interesting categories: the inarticulate librarian model

• Depression psychology, economics, meteorology, antiques…– Psychology treatment of depression,

depression symptoms, seasonal affective…– Psychology Kocsis, J. (10), Berg, R. (8), …

• Themes could come from static metadata or dynamically by analysis of results text– Static: fixed, clear categories and assignments– Dynamic: doesn’t require metadata/taxonomy

Page 11: Next generation search

Clustering benefitsClustering benefitsClustering benefitsClustering benefits• Disambiguates and refines search results to

get to documents of interest quickly• Can navigate long result lists hierarchically

– Would never offer thousands of choices to choose from as input…

– Access to bottom of list…maybe just less common

• Discovery – new aspects or sources• Can narrow results *after* search

– Start with the broadest area search – don’t narrow by subject or other categories first

– Easier, plus can’t guess wrong, miss useful, or pick unneeded, categories…results-driven

• Knee surgery cartilage replacement, plastics, …

Page 12: Next generation search
Page 13: Next generation search
Page 14: Next generation search
Page 15: Next generation search
Page 16: Next generation search

Answering hard questionsAnswering hard questionsAnswering hard questionsAnswering hard questions• Main problem is still short searches/under-

specification• One solution: Relevance feedback –

marking good and bad results • A long-standing and proven search

refinement technique– More information is better than less – Pseudo-relevancy feedback is a research

standard• Most commercial forms not widely used…• …but Pubmed is an exception • A catch: Must first find a good document to

be similar to….may be hard or impossible

Page 17: Next generation search

One solution: descriptive One solution: descriptive searchsearchOne solution: descriptive One solution: descriptive searchsearch

• Let the user or situation provide the ideal “document” – a full problem description – as input in the first place– Can enter free text or specific documents

describing the need, e.g., an article, grant proposal or experiment description

– Might draw on user or query context -- user characteristics (MD or nurse), patient record,…

– Use thesauri, domain knowledge and limited natural language processing to identify must-have’s

• Main focus, pre-existing conditions, etc.– Should provide the best possible search

short of real language understanding

Page 18: Next generation search

Summarize, discover & analyzeSummarize, discover & analyzeSummarize, discover & analyzeSummarize, discover & analyze• How do you summarize a corpus?

– May want to report on what’s present, numbers of occurrences, trends, etc.

– Ex: What diseases are studied the most?– Must know all diseases and look one by one

• How to you find a relationship if you don’t know what relationships exist?– Ex:does gene p53 relate to any disease?– Must check for each possible relationship

• Ad hoc analysis– How do all genes relate to this one disease?

Over time? What organisms have the gene been studied in? Show me the document evidence

Page 19: Next generation search

One solution: text miningOne solution: text miningOne solution: text miningOne solution: text mining• Identify entities (things) in a text corpus

– Examples: authors, universities… diseases, drugs, side-effects, genes…companies, law suits, plaintiffs, defendants…

– Use lexicons, patterns, NLP for finding any or all instances of the entity (including new ones)

• Identify relationships:– Through co-occurrence

• Relationship presumed from proximity• Example: author-university affiliation

– Through limited natural language processing• Semantic relations – causes, is-part-of, etc.• Examples: drug-causes-disease, drug-treats-disease• Identify appropriate verbs, recognize active vs. passive

voice, resolve anaphora (…it causes…)

Page 20: Next generation search
Page 21: Next generation search
Page 22: Next generation search
Page 23: Next generation search

Elsevier pilot projectElsevier pilot projectElsevier pilot projectElsevier pilot project

• Goal: Demonstrate real value to a working expert in 90 days

• Chose biomedical domain• Hired expert to help define entities

and relationships• Used 25,000 abstracts from 23

Elsevier journals• Worked with text mining vendor to

define and revise extraction of entities and relationships

Page 24: Next generation search

Pilot scenariosPilot scenariosPilot scenariosPilot scenarios

• Answered real questions using real data – not a demo or mock-up

• The user:– anyone involved in genomic academic

research: a primary researcher, graduate student or post-doc

• Scenario 1: Research about gene p53– What journals should I publish in? – Who’s an expert I can ask for advice? – What connections have been made to my

gene?– What organisms have my gene?

Page 25: Next generation search

What journals should I publish in?

Page 26: Next generation search

Who’s an expert?

Page 27: Next generation search

Connections to p53?

Page 28: Next generation search

To organisms?

Page 29: Next generation search

Pilot scenariosPilot scenariosPilot scenariosPilot scenarios

• Scenario 2: Disease research– What diseases are most researched?– What’s the time trend in HIV research?– What are the centers of HIV research?– Who are the author teams in HIV?– What gene-disease relationships are

there? What were they to start in 1996? through 1997?

– (Note: Cannot practically answer the above with search alone)

Page 30: Next generation search

What diseases are most researched?

Page 31: Next generation search

Time trend in HIV research?

Page 32: Next generation search

Centers of HIV research?

Page 33: Next generation search

Author teamsIn HIV research?

Page 34: Next generation search

Gene-disease relationships?

Page 35: Next generation search

To start, in 1996?

Page 36: Next generation search

Through 1997?

Page 37: Next generation search

Pilot scenariosPilot scenariosPilot scenariosPilot scenarios

• Scenario 3: Connections between leukemia and Alzheimer’s– Are there direct connections between

leukemia and Alzheimer’s?– What enzymatic activity is

associated with leukemia?– Are there indirect connections

between leukemia and Alzheimer’s mediated by enzymatic activity?

Page 38: Next generation search

Direct connections between leukemia and Alzheimer’s?

Page 39: Next generation search

Enzymes associated with leukemia?

Page 40: Next generation search

Indirect links fromleukemia to Alzheimer’s via enzymes

Page 41: Next generation search
Page 42: Next generation search

Red – Product

Pink – Reactant

Green – Reagent

Brown – Solvent

Page 43: Next generation search

The power of text miningThe power of text miningThe power of text miningThe power of text mining

• Almost impossible to determine manually

• Can provide completely unexpected relationships between source and target

• Catch: must do the work domain by domain

• Silver lining: can build on preceding work

Page 44: Next generation search

Long-term: answer any Long-term: answer any questionquestionLong-term: answer any Long-term: answer any questionquestion

• Must recognize multiple (any) entities and relationships

• Must recognize all forms of linguistic relationship

• Must have background of common sense information (or enough entities/relations?)– Information on donors (to political parties)

• For now, building text miners, domain by domain, is perhaps the best we can do– Can build on preceding pieces…e.g., if you

know drugs, diseases and drug-disease causation, can try to recognize ‘advancements in drug therapy’

Page 45: Next generation search

SummarySummarySummarySummary• Need to search more broadly, more easily

– Larger databases– Distributed search

• Need to locate best documents in even larger (distributed) databases– Clustering to find documents of real interest

• Need to answer complex questions– Descriptive search

• Need to go beyond search for overviews, relationship discovery and analysis– Text-based data mining

• Through text mining (perhaps), approach full natural language understanding


Recommended