Why is Computational Linguistics Not More Used in Search Engine Technology?
John Tait
University of Sunderland, UK
Contents of Talk
• Introduction– Search Engines– Computational Linguistics
• Three Questions
• Research Agenda
• Conclusions
Introduction
Origins
• TREC has been running since 1992– Only two systems using CL techniques (Strzalkowski et
al in the 1990 and more recently Stokoe, Oakes and Tait) have ever shown an improvement on the standard search engine task
• Performance on most tasks is improved by using more information – Surely dictionaries, grammars, semantics should help ?
• ACL/COLING CLIIR Workshop in Sydney
Search Engine Process
Signature Data
Crawler IndexQuery Engine
Searcher
Web
Index
• Compressed and Abstracted form of data (web pages) allowing rapid access to some part of that data– Simple version
• Maps key words (signature) to URL’s (data)
– Real Systems • Compressed vector of weighted terms (signature) to
URL’s plus snippet generation support (+ ad’s etc.)
Crawler
• Continuously in the background– Moves over web pages accumulating
• Signature data – term data
– metadata
– links
• URLS, URI’s
• etc.
– Updates the index
Query Engine
• User types in a query– Usually short list of key words
• Organizes results– “Best” pages first– Summary to judge page relevance– Clickable links– Relevant ad’s– Etc.
IR
Its all about matching documents and queries
Myths about Statistical IR
I.e. Search Engines
Myths
1. It doesn’t work• Google ?
2. It’s a dead subject• 40% improvement since TREC began in 1991• Recent progress with e.g. Language Modeling and
Continuous Relevance Models
3. IR people don’t know/care about language• Karen Sparck Jones • much early work• Strzalkowski et al …
Computational Linguistics
What is it ?
Characteristics of CL
• Assumes Existence of – Dictionary– Grammar– Semantics
• Independent of task
• Dependent on word meanings
• Arrived at through composition of words in sentences
Statistical CL
• Often– Aims to make immediate progress with
practical tasks– Minimizes assumptions about language
• But still shares the common assumptions of CL
IR/Search Engines
Only care about the task
Characteristics of CL
• Assumes Existence of – Dictionary– Grammar– Semantics
• Independent of task
• Dependent on word meanings
• Arrived at through composition of words in sentences
Three Questions about
Why Search Engines don’t use Computational Linguistics
Disclaimer
• Question Answering– QA systems use CL– Do search engines use QA now ?
• askJeeves ??? – Will they in the future ?
• Will we ever get general/casual users to type long questions ?
• Have known for a long time long queries are good -rarely used
Three Questions
• Are Computational Linguistic Techniques too inaccurate to improve Search Engines?
• Is the Search Engine Task formed in some way which makes CL techniques ineffective?
• Does statistical information retrieval in fact capture the relevant properties of language but in a form which is inaccessible or hidden?
CL too inaccurate ?
• Long Version– Is the problem that computational linguistic
techniques are too unreliable or narrowly applicable, so improved performance on some documents or queries is masked by worse performance on others?
Example of Problem
• Query “wants” and unusual word sense– “main head design”
• Topic “yachting”: “head” “head of sail”
• Irrelevant retrieved document 1 had a signature generated off an inaccurate word sense (“body part”)– CL eliminates
• Irrelevant not retrieved document 2– Word Sense Disambiguation inaccurate
• Added to relevant set
CL too inaccurate ?
• But best systems do >97% on test data - like Penn Treebank ?– Overfitted on very sparse data?– Don’t do anything like as well on unseen data– Especially bad at unseen noun phrases - very
common search terms
What CL should do
• Stop working on pitifully small samples:– IR researchers consider 18Gigabytes too small
for real statistical significance
• Ensure you include overfitting protection in your methodology– Always test against genuinely unseen data
• Don’t simplify the data– But do use “hacks” to make it tractable
Search task not match CL
• Is the conventional information retrieval task formulated in a way which prevents or obstructs computational linguistics contributing?– Short queries
• Not sentences, running text
– Short Ranked Lists of highly relevant documents
– Predetermined document signatures
Search Task not match CL ?
• CL allows the extraction of structural signatures
1. Bracketing is combinatoric– Effect on index size
2. Most queries too short to get structure– Remember its matching queries and documents
(signatures)
3. Many queries too short to disambiguate– Really ??? Co-occurence
What CL should do
• Focus on Word Sense Disambiguation– Accept the dictionary is more important than
grammar– Accept proper names/named entities are at least
as important as common words
• Focus on chunking/triple/phrase extraction– Full parsing will only ever help as an
intermediate step
IR Captures Relevant Properties
• Long Version– Does statistical information retrieval in fact capture the
relevant properties of language but in a form which is inaccessible or hidden?
– Just like many machine learning techniques
IR captures relevant properties?
• Could be ?– Success of corpus linguistics– Success of data driven and Machine Learning
approaches• E.g. Statistical MT
• E.g. Textual Entailment
What CL should do
• Look at what and whether IR term weighting algorithms like BM25 are capturing about language as a legitimate research topic– Observation: BM25 looks very like some Machine
Learning generated formulae• Hardly surprising as BM25 derived by optimisation over a
very large corpus• Like Porter Stemmer before it
• Consider whether and to what extent division into dictionary, syntax, semantics is “real”
Some more questions
• Are assumptions made in computational linguistics about the nature of lexical semantics and the structural properties of well formed running text in some way ill founded, at least for the information retrieval task?
• Is there some specific property of language (for example semantic redundancy or one topic per document) which means that the relatively crude statistical techniques capture enough information to obtain the available improvements in performance?
Lessons
• CL has much to learn from IR– Having a task changes the game
• Allows the development of effective experimental methodology
• Effective solutions to task problems becomes the focus
– Which might in turn stimulate non-task based research
Lessons 2
• CL for IR– Needs to work on better document signatures
• Small, compressible, characteristics of documents– Word sense identifiers
– Triples
• Noun verb/prep Noun
• Chunks
– Accept probability
Lesson 3
• Show document structure is useful for determining relevance– Are sentences useful
• So can parse trees be useful– Human centred evaluation
– Paragraphs ??– Whole Documents ???
Conclusions
• IR can benefit from Computational Linguistics Techniques– But CL research needs to focus on the relevant
problems
• CL can benefit greatly from trying to get acceptance in IR– Focussed task
– Think of statistical MT
Job Ad
• Postdoc positions in multimedia retrieval available in Sunderland
• Search for Sunderland IR Group on the Web• See:
– http://my.sunderland.ac.uk/web/services/hr/recruitment/
– Search for VITALAS
• Email me:– [email protected]