+ All Categories
Home > Documents > CS344: Introduction to Artificial Intelligence

CS344: Introduction to Artificial Intelligence

Date post: 23-Mar-2016
Category:
Upload: lynde
View: 65 times
Download: 0 times
Share this document with a friend
Description:
CS344: Introduction to Artificial Intelligence. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32: Information Retrieval: Basic concepts and Model. The elusive user satisfaction. Ranking. Correctness of Query Processing. Coverage. Indexing. Crawling. NER. Stemming. MWE. - PowerPoint PPT Presentation
36
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32: Information Retrieval: Basic concepts and Model
Transcript
Page 1: CS344: Introduction to Artificial Intelligence

CS344: Introduction to Artificial Intelligence

Pushpak BhattacharyyaCSE Dept., IIT Bombay

Lecture 32: Information Retrieval: Basic concepts and Model

Page 2: CS344: Introduction to Artificial Intelligence

The elusive user satisfactionRanking

Correctnessof

Query ProcessingCoverage

NERStemming MWE

Crawling Indexing

Page 3: CS344: Introduction to Artificial Intelligence

Query: Indian Tribes in Latin America

Page 4: CS344: Introduction to Artificial Intelligence

Google Indians of Latin America: an exhibition of materials in the Lilly ... Lilly Library: Latin American mss. Brazil. A large map in colors, this locates the course of

rivers, towns, mountain ranges, and Indian tribes. ...www.indiana.edu/~liblilly/etexts/ila/ - 241k - Cached - Similar pages - Note this

Indigenous peoples of the Americas - Wikipedia, the free encyclopedia American Indian creation legends tell of a variety of originations of ..... that it had confirmed

the presence of 67 different uncontacted tribes in Brazil, ...en.wikipedia.org/wiki/Indigenous_peoples_of_the_Americas - 178k - Cached - Similar pages - Note this

Cognition :: Giving Technologies New Meaning The volumes that Farabee produced from his travels include Indian Tribes of Eastern Peru ...

motor vehicles that are lemons · Indian tribes of Latin America ...wikipedia.cognition.com/?num=10&from_val...Indian%20tribes%20of%20Latin%20Ame... - 54k - Cached - Similar pages - Note this

Top 25 American Indian Tribes for the United Top 25 American Indian Tribes for the UnitedStates: 1990 and 1980--Con. ... 16028 73.0

Canadian and Latin American... 19375 248.3 Chickasaw. ...www.census.gov/population/socdemo/race/indian/ailang1.txt - 6k - Cached - Similar pages - Note this

Ten Largest American Indian Tribes, 2000 — Infoplease.com Latin American Indian, 180940. Choctaw, 158774. Sioux, 153360 ... American Indian and

Alaska Native Population by Selected Tribes, Census 2000 ...www.infoplease.com/ipa/A0767349.html - 29k - Cached - Similar pages - Note this

The Indian Tribes of North America by John R. Swanton at Questia ... Read the complete book The Indian Tribes of North America by becoming a ..... Sao Paulo

recently elected its...must cope with demands by Latin America for ...www.questia.com/library/book/the-indian-tribes-of-north-america-by-john-r-swanton.jsp - Similar pages - Note this

Page 5: CS344: Introduction to Artificial Intelligence

Yahoo different indian tribes of latin america, More... WEB RESULTS South America Daily Indian Pepper Photos Prices Spices. The Times of India ... Archaeologists unearth ancient tribe members sacri London ...

Iran and the left in Latin America ... www.wn.com/LatinAmerica - 192k Native American Indian Cultures - Mexico, South America Also, many of the Yanomamo tribe are losing their members and culture by ... of Amazon Indian tribal art in the world, with

over 75 tribes represented. ... indian-cultures.com - Cached Native American Indian Cultures - links North American Tribes. rednation.org - RedNation of the Cherokee. Meso and Latin American Indians ... Human Rights in

Latin America ... www.indian-cultures.com/Cultures/Links.html - Cached Indigenous peoples of the Americas - Wikipedia, the free encyclopedia ... in America, particularly with regards to native Indians. ... Uncontacted Indian tribe found in Brazil's Amazon. The

Peopling of the American Continents ... en.wikipedia.org/wiki/Indigenous_peoples_of_the_Americas - 179k - Cached Native American Images - American Indian North America Tribe Map American Indian North America Tribe Map. Click here to view more images ... Medal | History Hotline | Iraqi War | Korean

War | Latin Americans | Medal of ... www.nativeamericans.com/NativeAmericanImages6.htm - Cached Resources for Numbers of Native Americans or Indians in Latin America: 39,442,000 million ... Indian Tribes in Latin America - Latin

American Indian Population - Up date ... www.xmission.com/~amauta/population.htm - Cached Indian tribe found in Brazil's Amazon - Boston.com Latin America/Caribbean. Indian tribe found in Brazil's Amazon ... Uncontacted tribes are usually discovered when

loggers and ranchers encroach on ... boston.com/news/world/latinamerica/articles/2007/06/01/.../ News

Page 6: CS344: Introduction to Artificial Intelligence

AltaVista Latin America

Compare airfare prices from over 120 top websites and save up to 70%.Flights.SideStep.comRegional Telecom Statistics & ForecastsFixed, mobile, Internet, broadband telecom statistics and forecasts.www.hottelecom.com

AltaVista found 4,520,000 results 

South America DailyIndian Pepper Photos Prices Spices. The Times of India ... Archaeologists unearth ancient tribe members sacri London ... Iran and the left in Latin America ...www.wn.com/LatinAmerica More pages from wn.com 

Native American Indian Cultures - Mexico, South AmericaAlso, many of the Yanomamo tribe are losing their members and culture by ... of Amazon Indian tribal art in the world, with over 75 tribes represented. ...indian-cultures.com More pages from indian-cultures.com 

Indian tribes in Suriname cross borders - Boston.comDays of rain near Suriname's southern border have deluged Amerindian farmland, ... Latin America/Caribbean. Indian tribes in Suriname cross borders ...www.boston.com/news/world/latinamerica/articles/2006/05/12...in_suriname_cross_borders More pages from boston.com 

Indigenous peoples of the Americas - Wikipedia, the free encyclopedia... in America, particularly with regards to native Indians. ... Uncontacted Indian tribe found in Brazil's Amazon. The Peopling of the American Continents ...en.wikipedia.org/wiki/Indigenous_peoples_of_the_Americas More pages from en.wikipedia.org 

Native American Images - American Indian North America Tribe MapAmerican Indian North America Tribe Map. Click here to view more images ... Medal | History Hotline | Iraqi War | Korean War | Latin Americans | Medal of ...www.nativeamericans.com/NativeAmericanImages6.htm More pages from nativeamericans.com 

Page 7: CS344: Introduction to Artificial Intelligence

MSN Native American Images - American Indian North America Tribe Map Native American Images American Indian North America Tribe Map Click here to view more images ... History

Hotline | Iraqi War | Korean War | Latin Americans ... www.nativeamericans.com/ NativeAmericanImages6.htm · Cached page

Resources for ... 152t.), Panama (126t.), Paraguay (67t.), Surinam (10t.), and Venezuela (331t.) (t.=thousand). - Indian Tribes

in Latin America www.xmission.com/ ~amauta/ tribes.htm · Cached page

Latin America Community Assistance Foundation - LACA The Tarahumara Indians are the most primitive of all Indian tribes in North America, and are the least touched

by modern society. www.lacafoundation.org/?page_id=58 · Cached page

Latin America Tour Set for Curtis Photos of North America Tribes 28 September 2005. Latin America Tour Set for Curtis Photos of North America Tribes. Famed photographer

recorded Indian tribal life in 19th, early 20th century www.america.gov/ st/ washfile-english/ 2005/ September/ 20050928134700GLnesnoM0.2225763.html

Latin America // Current Current TV Latin America category, discover popular Latin America stories, news and ... of the Amazon jungle,

a land conflict between rice farmers and a handful of Indian tribes ... current.com/ topics/ 75844112_ latin_ america · Cached page

Bloomberg.com: Latin America May 30 (Bloomberg) -- Brazil's National Indian Foundation has discovered an Indian tribe in the Amazon that

hasn't had contact with civilization in a rare sighting of the few ... www.bloomberg.com/ apps/ news? pid= 20601086& sid= aSrj5wfHW.CQ& refer= latin_ america

Page 8: CS344: Introduction to Artificial Intelligence

Personalized focused search (wikipedia.cognition) Indian Latin-America tribe: 249 files — William Curtis Farabee The volumes that Farabee produced from his travels include Indian Tribes of Eastern Peru based on his first trip in 1906-1908

(Obituary, 1925). Direct link (no highlighting) Mexican Texas Settlers were empowered to create their own militias to help control hostile Indian tribes. Texas faced raids from both the

Apache and Comanche tribes, [...] Direct link (no highlighting) Temecula, California The Luiseño and Cahuilla tribes were involved, rather bloodily, in the local battles of the Mexican-American War during the

following years. Direct link (no highlighting) Kaweah Indian Nation Recently, scam artists have sold purported citizenships in the non-recognized tribe, particularly to Mexican nationals who

have entered the US illegally.1 [...] Direct link (no highlighting) Flag of Puerto Rico The tribal nation flag of the Jatibonicu Taino Indians of Borikén, represents the Jatibonicu Taino tribe's original pre-Columbian

territories of [...] Direct link (no highlighting) Maina Indians The Maina Indians are a group of tribes constituting a distinct linguistic stock, the [...] along the north bank of the Marañón

River in South America Direct link (no highlighting) Erie (tribe) ^ Ebooks by Google: "Handbook of American Indians North of Mexico" By Frederick Webb Hodge

http://books.google.com/books? Direct link (no highlighting) Miccosukee [1] Other members went on to form the Miccosukee Tribe of Indians of Florida, which was not recognized by Fidel Castro's

Cuban government in 1959. The [...] Direct link (no highlighting) New Tribes Mission In Paraguay in 1979 and 1986, New Tribes Mission was accused of assisting in the forcible contact of nomadic Ayoreo Indians. Direct link (no highlighting)

Page 9: CS344: Introduction to Artificial Intelligence

Example: Semantically precise search for relations/events

Query: afghans destroying opium poppies

Page 10: CS344: Introduction to Artificial Intelligence

India Wide Cross Lingual Information Access (CLIA) Endeavour

Page 11: CS344: Introduction to Artificial Intelligence

Motivation English still the most

dominant language on the web

Contributes 72% of the content

Number of non-English users steadily rising all over the world

English penetration in India

Estimated to be around 3-4%

Mostly the urban educated class

Need to enable access to above information through local languages

Page 12: CS344: Introduction to Artificial Intelligence

Cross Language Information Retrieval (CLIR)

Crawled and IndexedWeb Pages

Target Informationin English

ति�रूपति� यात्राHindi Query

CLIR Engine

Target Language Indexin English

Ranked List of Results

Language Resources

ति�रूपति� आने के लि�ए रे� साधन

ति�रूपति� पुण्य नगर पहँुचने के लि�ए बहु� रे� उप�ब्ध हैं | अगर मुंबई से यात्रा कर रहे है �ो मुंबई-चेन्नई एक्सपे्रस गाड़ी से प्रवास कर सक�े है |

ति�रूपति� यात्रा

Result Snippetsin Hindi

Page 13: CS344: Introduction to Artificial Intelligence

Challenges involved in CLIA Indexing, retrieval and ranking of multilingual

documents Web data is not clean and regular

Different font encodings – some of them proprietary Spelling variations very common Different document encodings

Language identification needed to invoke appropriate language analyzers

Involves a number of fundamental NLP research problems like query disambiguation, machine transliteration, named-entity recognition, multi-word recognition

Page 14: CS344: Introduction to Artificial Intelligence

Cross Language Information Access (CLIA) Consortia Project Indian Language CLIR Engine under development

Input – Six Indian Languages (Hindi, Bengali, Telugu, Tamil, Marathi and Punjabi)

Output – Hindi, English and Input Language of Query Domains – Tourism (Current Release)

Involves 10 academic institutes all over the country: IITs, Indian Statistical Institute, CDAC, Anna University, Jadavpur University

IIT Bombay – Overall co-ordinator Responsible for Hindi, Marathi language verticals

Includes full-fledged search features Snippet translation Summary generation Information Extraction

Page 15: CS344: Introduction to Artificial Intelligence

Portal Public portal released at

http://www.clia.iitb.ac.in/clia-beta-ext/ in September 2009. (Outside IITB)

Public portal released at http://www.clia.iitb.ac.in:8080/clia-beta-ext/ in September 2009. (Inside IITB)

Page 16: CS344: Introduction to Artificial Intelligence
Page 17: CS344: Introduction to Artificial Intelligence

Recent Press Coverage

Page 18: CS344: Introduction to Artificial Intelligence

Hindustan Times

Page 19: CS344: Introduction to Artificial Intelligence

IR Basics(mainly from R. Baeza-Yates and B. Ribeiro-Neto.

Modern Information RetrievalAddison-Wesley, Wokingham, UK, 1999.

andChristopher D. Manning, Prabhakar Raghavan and

Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. )

Page 20: CS344: Introduction to Artificial Intelligence

Definition of IR ModelAn IR model is a quadrupul

[D, Q, F, R(qi, dj)]Where,

D: documentsQ: QueriesF: Framework for modeling document, query and their relationshipsR(.,.): Ranking function returning a real no. expressing the relevance of dj with qi

Page 21: CS344: Introduction to Artificial Intelligence

Index Terms Keywords representing a document Semantics of the word helps

remember the main theme of the document

Generally nouns Assign numerical weights to index

terms to indicate their importance

Page 22: CS344: Introduction to Artificial Intelligence

IntroductionDocs

Information Need

Index Terms

doc

query

Rankingmatch

Page 23: CS344: Introduction to Artificial Intelligence

Classic IR Models - Basic Concepts

• The importance of the index terms is represented by weights associated to them

• Let – t be the number of index terms in the system– K= {k1, k2, k3,... kt} set of all index terms– ki be an index term– dj be a document – wij is a weight associated with (ki,dj)– wij = 0 indicates that term does not belong to doc– vec(dj) = (w1j, w2j, …, wtj) is a weighted vector

associated with the document dj– gi(vec(dj)) = wij is a function which returns the

weight associated with pair (ki,dj)

Page 24: CS344: Introduction to Artificial Intelligence

The Boolean Model• Simple model based on set theory• Only AND, OR and NOT are used• Queries specified as boolean expressions

– precise semantics– neat formalism– q = ka (kb kc)

• Terms are either present or absent. Thus, wij {0,1}• Consider

– q = ka (kb kc)– vec(qdnf) = (1,1,1) (1,1,0) (1,0,0)– vec(qcc) = (1,1,0) is a conjunctive component

Page 25: CS344: Introduction to Artificial Intelligence

The Boolean Model

• q = ka (kb kc)

• sim(q,dj) = 1 if vec(qcc) | (vec(qcc) vec(qdnf))

(ki, gi(vec(dj)) = gi(vec(qcc))) 0 otherwise

(1,1,1)(1,0,0)

(1,1,0)Ka Kb

Kc

Page 26: CS344: Introduction to Artificial Intelligence

Drawbacks of the Boolean Model

• Retrieval based on binary decision criteria with no notion of partial matching

• No ranking of the documents is provided (absence of a grading scale)

• Information need has to be translated into a Boolean expression which most users find awkward

• The Boolean queries formulated by the users are most often too simplistic

• As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

Page 27: CS344: Introduction to Artificial Intelligence

The Vector Model• Use of binary weights is too limiting• Non-binary weights provide

consideration for partial matches• These term weights are used to

compute a degree of similarity between a query and each document

• Ranked set of documents provides for better matching

Page 28: CS344: Introduction to Artificial Intelligence

The Vector Model• Define:

– wij > 0 whenever ki dj

– wiq >= 0 associated with the pair (ki,q)– vec(dj) = (w1j, w2j, ..., wtj)

vec(q) = (w1q, w2q, ..., wtq)

• In this space, queries and documents are represented as weighted vectors

Page 29: CS344: Introduction to Artificial Intelligence

The Vector Model

• Sim(q,dj) = cos() = [vec(dj) vec(q)] / |dj| * |q|= [ wij * wiq] /

|dj| * |q|• Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1• A document is retrieved even if it matches the query terms

only partially

i

j

dj

q

Page 30: CS344: Introduction to Artificial Intelligence

The Vector Model• Sim(q,dj) = [ wij * wiq] / |dj| * |q|• How to compute the weights wij and

wiq ?• A good weight must take into account two

effects:– quantification of intra-document contents

(similarity)• tf factor, the term frequency within a document

– quantification of inter-documents separation (dissi-milarity)• idf factor, the inverse document frequency

– wij = tf(i,j) * idf(i)

Page 31: CS344: Introduction to Artificial Intelligence

The Vector Model• Let,

– N be the total number of docs in the collection– ni be the number of docs which contain ki

– freq(i,j) raw frequency of ki within dj

• A normalized tf factor is given by– f(i,j) = freq(i,j) / maxl(freq(l,j))– where the maximum is computed over all terms

which occur within the document dj• The idf factor is computed as

– idf(i) = log (N/ni)– the log is used to make the values of tf and idf

comparable. It can also be interpreted as the amount of information associated with the term ki.

Page 32: CS344: Introduction to Artificial Intelligence

The Vector Model• The best term-weighting schemes use weights which

are give by – wij = f(i,j) * log(N/ni)– the strategy is called a tf-idf weighting scheme

• For the query term weights, a suggestion is– wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) *

log(N/ni)• The vector model with tf-idf weights is a good ranking

strategy with general collections• The vector model is usually as good as the known

ranking alternatives. It is also simple and fast to compute.

Page 33: CS344: Introduction to Artificial Intelligence

The Vector Model• Advantages:

– term-weighting improves quality of the answer set

– partial matching allows retrieval of docs that approximate the query conditions

– cosine ranking formula sorts documents according to degree of similarity to the query

• Disadvantages:– assumes independence of index terms (??);

not clear that this is bad though

Page 34: CS344: Introduction to Artificial Intelligence

The Vector Model: Example I

k1 k2 k3 q djd1 1 0 1 2d2 1 0 0 1d3 0 1 1 2d4 1 0 0 1d5 1 1 1 3d6 1 1 0 2d7 0 1 0 1

q 1 1 1

d1

d2

d3d4 d5

d6d7

k1k2

k3

Page 35: CS344: Introduction to Artificial Intelligence

The Vector Model: Example II

d1

d2

d3d4 d5

d6d7

k1k2

k3

k1 k2 k3 q djd1 1 0 1 4d2 1 0 0 1d3 0 1 1 5d4 1 0 0 1d5 1 1 1 6d6 1 1 0 3d7 0 1 0 2

q 1 2 3

Page 36: CS344: Introduction to Artificial Intelligence

The Vector Model: Example III

d1

d2

d3d4 d5

d6d7

k1k2

k3

k1 k2 k3 q djd1 2 0 1 5d2 1 0 0 1d3 0 1 3 11d4 2 0 0 2d5 1 2 4 17d6 1 2 0 5d7 0 5 0 10

q 1 2 3


Recommended