Download - Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK.

Why is Computational Linguistics Not More Used in Search Engine Technology?

John Tait

University of Sunderland, UK

Contents of Talk

• Introduction– Search Engines– Computational Linguistics

• Three Questions

• Research Agenda

• Conclusions

Introduction

Origins

• TREC has been running since 1992– Only two systems using CL techniques (Strzalkowski et

al in the 1990 and more recently Stokoe, Oakes and Tait) have ever shown an improvement on the standard search engine task

• Performance on most tasks is improved by using more information – Surely dictionaries, grammars, semantics should help ?

• ACL/COLING CLIIR Workshop in Sydney

Search Engine Process

Signature Data

Crawler IndexQuery Engine

Searcher

Web

Index

• Compressed and Abstracted form of data (web pages) allowing rapid access to some part of that data– Simple version

• Maps key words (signature) to URL’s (data)

– Real Systems • Compressed vector of weighted terms (signature) to

URL’s plus snippet generation support (+ ad’s etc.)

Crawler

• Continuously in the background– Moves over web pages accumulating

• Signature data – term data

– metadata

– links

• URLS, URI’s

• etc.

– Updates the index

Query Engine

• User types in a query– Usually short list of key words

• Organizes results– “Best” pages first– Summary to judge page relevance– Clickable links– Relevant ad’s– Etc.

IR

Its all about matching documents and queries

Myths about Statistical IR

I.e. Search Engines

Myths

1. It doesn’t work• Google ?

2. It’s a dead subject• 40% improvement since TREC began in 1991• Recent progress with e.g. Language Modeling and

Continuous Relevance Models

3. IR people don’t know/care about language• Karen Sparck Jones • much early work• Strzalkowski et al …

Computational Linguistics

What is it ?

Characteristics of CL

• Assumes Existence of – Dictionary– Grammar– Semantics

• Independent of task

• Dependent on word meanings

• Arrived at through composition of words in sentences

Statistical CL

• Often– Aims to make immediate progress with

practical tasks– Minimizes assumptions about language

• But still shares the common assumptions of CL

IR/Search Engines

Only care about the task

Characteristics of CL

• Assumes Existence of – Dictionary– Grammar– Semantics

• Independent of task

• Dependent on word meanings

• Arrived at through composition of words in sentences

Three Questions about

Why Search Engines don’t use Computational Linguistics

Disclaimer

• Question Answering– QA systems use CL– Do search engines use QA now ?

• askJeeves ??? – Will they in the future ?

• Will we ever get general/casual users to type long questions ?

• Have known for a long time long queries are good -rarely used

Three Questions

• Are Computational Linguistic Techniques too inaccurate to improve Search Engines?

• Is the Search Engine Task formed in some way which makes CL techniques ineffective?

• Does statistical information retrieval in fact capture the relevant properties of language but in a form which is inaccessible or hidden?

CL too inaccurate ?

• Long Version– Is the problem that computational linguistic

techniques are too unreliable or narrowly applicable, so improved performance on some documents or queries is masked by worse performance on others?

Example of Problem

• Query “wants” and unusual word sense– “main head design”

• Topic “yachting”: “head” “head of sail”

• Irrelevant retrieved document 1 had a signature generated off an inaccurate word sense (“body part”)– CL eliminates

• Irrelevant not retrieved document 2– Word Sense Disambiguation inaccurate

• Added to relevant set

CL too inaccurate ?

• But best systems do >97% on test data - like Penn Treebank ?– Overfitted on very sparse data?– Don’t do anything like as well on unseen data– Especially bad at unseen noun phrases - very

common search terms

What CL should do

• Stop working on pitifully small samples:– IR researchers consider 18Gigabytes too small

for real statistical significance

• Ensure you include overfitting protection in your methodology– Always test against genuinely unseen data

• Don’t simplify the data– But do use “hacks” to make it tractable

Search task not match CL

• Is the conventional information retrieval task formulated in a way which prevents or obstructs computational linguistics contributing?– Short queries

• Not sentences, running text

– Short Ranked Lists of highly relevant documents

– Predetermined document signatures

Search Task not match CL ?

• CL allows the extraction of structural signatures

1. Bracketing is combinatoric– Effect on index size

2. Most queries too short to get structure– Remember its matching queries and documents

(signatures)

3. Many queries too short to disambiguate– Really ??? Co-occurence

What CL should do

• Focus on Word Sense Disambiguation– Accept the dictionary is more important than

grammar– Accept proper names/named entities are at least

as important as common words

• Focus on chunking/triple/phrase extraction– Full parsing will only ever help as an

intermediate step

IR Captures Relevant Properties

• Long Version– Does statistical information retrieval in fact capture the

relevant properties of language but in a form which is inaccessible or hidden?

– Just like many machine learning techniques

IR captures relevant properties?

• Could be ?– Success of corpus linguistics– Success of data driven and Machine Learning

approaches• E.g. Statistical MT

• E.g. Textual Entailment

What CL should do

• Look at what and whether IR term weighting algorithms like BM25 are capturing about language as a legitimate research topic– Observation: BM25 looks very like some Machine

Learning generated formulae• Hardly surprising as BM25 derived by optimisation over a

very large corpus• Like Porter Stemmer before it

• Consider whether and to what extent division into dictionary, syntax, semantics is “real”

Some more questions

• Are assumptions made in computational linguistics about the nature of lexical semantics and the structural properties of well formed running text in some way ill founded, at least for the information retrieval task?

• Is there some specific property of language (for example semantic redundancy or one topic per document) which means that the relatively crude statistical techniques capture enough information to obtain the available improvements in performance?

Lessons

• CL has much to learn from IR– Having a task changes the game

• Allows the development of effective experimental methodology

• Effective solutions to task problems becomes the focus

– Which might in turn stimulate non-task based research

Lessons 2

• CL for IR– Needs to work on better document signatures

• Small, compressible, characteristics of documents– Word sense identifiers

– Triples

• Noun verb/prep Noun

• Chunks

– Accept probability

Lesson 3

• Show document structure is useful for determining relevance– Are sentences useful

• So can parse trees be useful– Human centred evaluation

– Paragraphs ??– Whole Documents ???

Conclusions

• IR can benefit from Computational Linguistics Techniques– But CL research needs to focus on the relevant

problems

• CL can benefit greatly from trying to get acceptance in IR– Focussed task

– Think of statistical MT

Job Ad

• Postdoc positions in multimedia retrieval available in Sunderland

• Search for Sunderland IR Group on the Web• See:

– http://my.sunderland.ac.uk/web/services/hr/recruitment/

– Search for VITALAS

• Email me:– [email protected]