Date post: | 26-Mar-2015 |
Category: |
Documents |
Upload: | julian-hawkins |
View: | 213 times |
Download: | 0 times |
Alternatives to Federated Search
-
Presented by: Marc KrellensteinDate: July 29, 2005
| 2
Why did we ever build federated search? No one search service or database had all relevant info
or ever could have It was too hard to know what databases to search Even if you knew which db’s to search, it was too
inconvenient to search them all Learning one simple interface was easier than learning
many complex ones
| 3
Do we still need federated search? No
| 4
No one service or db has all relevant info? Databases have grown bigger than ever imagined
Google: 8B documents, Google scholar: 400M+ ? Scirus: 200M Web of Knowledge (Humanities, Social Sci, Science): 28M Scopus: 27M Pubmed: 14M
Why? Cheaper and larger hard disks Faster hardware, better software World-wide network availability…no need to duplicate
| 5
No one service or db has all relevant info? No maximum size in sight
A good thing, because content continues to grow The simplest technical model for search
Databases are logically single and central …but physically multiple and internally distributed Google has ~160,000 servers
The simplest user model for search The catch (but even worse for federated search):
Get the data Keep search quality high
| 6
It’s hard to know what services to search? Google/Google Scholar plus 1-2 vertical search tools
Pubmed, Compendex, WoK, PsycINFO, Scopus, etc. For casual searches: Google alone is usually enough
Specialized smaller db’s where needed Known to researcher or librarian, or available from list
Ask a life science researcher what they use -- “All I need is Google and Pubmed”
| 7
It’s hard to know what services to search? Alerts, RSS, etc. eliminate some searches altogether Still…more than one search/source…but must balance
inconvenience against costs of federated search: Will still need to do multiple searches…federated not enough Least common denominator search – few advanced features
» Users are increasingly sophisticated Duplicates Slower response time Broken connectors The feeling that you’re missing stuff…
| 8
One interface is easier to learn than many? Yes…studies suggest users like a common interface (if
not a common search service) BUT Google has demonstrated the benefits of simplicity More products are adopting simple, similar interfaces There is still too much proprietary syntax – though
advanced features and innovation justify some of it
| 9
So what are today’s search challenges? Getting the data for centralized and large vertical search
services Keeping search quality high for these large databases Answering hard search questions
| 10
Getting the data for centralized services Crawl it if it’s free …or make or buy it
Expensive, but usually worth the cost Should still be cheaper for customers than many services
…or index multiple, maybe geographically separate databases with a single search engine that supports distributed search
| 11
Distributed (local/remote) search Use common metadata scheme (e.g., Dublin Core) Search engine provides parallel search, integrated ranking/results
Google, Fast and Lucene already work this way even for ‘single’ database The separate databases can be maintained/updated separately Results are truly integrated…as if it’s one search engine
One query syntax, advanced capabilities, no duplicates, fast Still requires common technology platform Federated search standards may someday approximate this
Standard syntax, results metadata…ranking? Amazon’s A9?
| 12
Keeping search quality high in big db’s Can interpret keyword, Boolean and pseudo-natural language
queries Spell checking, thesauri and stemming to improve recall (and
sometimes precision) Get lots of hits in a big db, but that’s usually OK if there are good
ones on top
| 13
Keeping search quality high in big db’s Current best practice relevancy ranking is pretty good:
Term frequency (TF): more hits count more Inverse document frequency (IDF): hits of rarer search terms count more Hits of search terms near each other count more Hits on metadata count more
» Use anchor text – referring text – as metadata Items with more links/references to them count more
» Authoritative links/referrers count yet more Many other factors: length, date, etc.
Sophisticated ranking is a weak point for federated search Google’s genius: emphasize popularity to eliminate junk from the
first pages (even if you don’t always serve the best)
| 14
But search challenges remain Finding the best (not just good) documents
Popularity may not turn up the best, most recent, etc. Answering hard questions
Hard to match multiple criteria» find an experimental method like this one
Hard to get answers to complex questions, » What precursors were common to World War I and World War II?
Summarize, uncover relationships, analyze Long-term: understand any question… None of the above helped by least common denominator
federated search
| 15
Finding the best Don’t rely too much on popularity Even then, relevancy ranking has its limits
“I need information on depression” “Ok…here are 2,352 articles and 87 books”
Need a dialog…”what kind of depression” …”psychological”…”what about it?”
Underlying problem: most searches are under-specified
| 16
One solution: clustering documents Group results around common themes: same author, web site,
journal, subject… Blurt out largest/most interesting categories: the inarticulate librarian
model Depression psychology, economics, meteorology, antiques…
Psychology treatment of depression, depression symptoms, seasonal affective…
Psychology Kocsis, J. (10), Berg, R. (8), … Themes could come from static metadata or dynamically by
analysis of results text Static: fixed, clear categories and assignments Dynamic: doesn’t require metadata/taxonomy
| 17
Clustering benefits Disambiguates and refines search results to get to documents of interest
quickly Can navigate long result lists hierarchically
Would never offer thousands of choices to choose from as input… Access to bottom of list…maybe just less common Won’t work with federated search that retrieves limited results from each
Discovery – new aspects or sources Can narrow results *after* search
Start with the broadest area search – don’t narrow by subject or other categories first
Easier, plus can’t guess wrong, miss useful, or pick unneeded, categories…results-driven
» Knee surgery cartilage replacement, plastics, …
| 18
| 19
| 20
| 21
Answering hard questions Main problem is still short searches/under-specification One solution: Relevance feedback – marking good and bad
results A long-standing and proven search refinement technique
More information is better than less Pseudo-relevancy feedback is a research standard
Most commercial forms not widely used… …but Pubmed is an exception A catch: Must first find a good document to be similar to….may be
hard or impossible
| 22
One solution: descriptive search Let the user or situation provide the ideal “document” – a full
problem description – as input in the first place Can enter free text or specific documents describing the need, e.g., an
article, grant proposal or experiment description Might draw on user or query context Use thesauri, domain knowledge and limited natural language processing
to identify must-have’s Uses lots of data and statistics to find best matches
» Again, a problem for federated search with limited data access Should provide the best possible search short of real language
understanding
| 23
Summarize, discover & analyze How do you summarize a corpus?
May want to report on what’s present, numbers of occurrences, trends Ex: What diseases are studied the most? Must know all diseases and look one by one
How to you find a relationship if you don’t know what relationships exist?
Ex:does gene p53 relate to any disease? Must check for each possible relationship
Ad hoc analysis How do all genes relate to this one disease? Over time? What organisms
have the gene been studied in? Show me the document evidence
| 24
One solution: text mining Identify entities (things) in a text corpus
Examples: authors, universities… diseases, drugs, side-effects, genes…companies, law suits, plaintiffs, defendants…
Use lexicons, patterns, NLP for finding any or all instances of the entity (including new ones)
Identify relationships: Through co-occurrence
» Relationship presumed from proximity» Example: author-university affiliation
Through limited natural language processing» Semantic relations – causes, is-part-of, etc.» Examples: drug-causes-disease, drug-treats-disease» Identify appropriate verbs, recognize active vs. passive voice, resolve anaphora (…it
causes…)
| 25
Gene-disease relationships?
| 26
Relationships to p53
| 27
Author teamsIn HIV research?
| 28
Indirect links fromleukemia to Alzheimer’s via enzymes
| 29
Long-term: answer any question Must recognize multiple (any) entities and relationships Must recognize all forms of linguistic relationship Must have background of common sense information (or enough
entities/relations?) Information on donors (to political parties)
For now, building text miners, domain by domain, is perhaps the best we can do
Can build on preceding pieces…e.g., if you know drugs, diseases and drug-disease causation, can try to recognize ‘advancements in drug therapy’
| 30
Summary Federated search addressed problems of a different time
Had a highly fragmented search space, limitations of individual db’s, technical and interface problems and need to just get basic answers
Today’s search environment is increasingly centralized and robust Range of content and demands of users continue to increase Adequate search is a given…really good search is a challenge
best served by new technologies that don’t fit into a least-common-denominator framework
Need to locate best documents (sophisticated ranking, clustering) Need to answer complex questions Need to go beyond search for overviews, relationship discovery