+ All Categories
Home > Documents > Web-based Information Architectures MSEC 20-760 Mini II

Web-based Information Architectures MSEC 20-760 Mini II

Date post: 05-Jan-2016
Category:
Upload: jamese
View: 35 times
Download: 0 times
Share this document with a friend
Description:
Web-based Information Architectures MSEC 20-760 Mini II. Location:GSIA Simon Auditorium Time:1:30-3:20pm, Tues. & Thurs. Instructor: Prof. Jaime Carbonell Office: NSH 4519 Email: [email protected] Tel:268-7279 [Augmented with expert guest lectures] - PowerPoint PPT Presentation
Popular Tags:
23
Web-based Information Architectures MSEC 20-760 Mini II Location: GSIA Simon Auditorium Time: 1:30-3:20pm, Tues. & Thurs. Instructor: Prof. Jaime Carbonell Office: NSH 4519 Email: [email protected] Tel: 268-7279 [Augmented with expert guest lectures] Teaching assistant: Jian Zhang Office: NSH 4605 Email: [email protected] Tel: 268-6521 Offices Hours: TBD Administrative assistant: TBD Office: NSH 4517 Email: [email protected] Tel: 268-4788
Transcript
Page 1: Web-based Information Architectures MSEC 20-760 Mini II

Web-based Information ArchitecturesMSEC 20-760

Mini IILocation: GSIA Simon AuditoriumTime: 1:30-3:20pm, Tues. & Thurs.Instructor: Prof. Jaime Carbonell

Office: NSH 4519Email: [email protected]: 268-7279

[Augmented with expert guest lectures]Teaching assistant: Jian Zhang

Office: NSH 4605Email: [email protected]: 268-6521Offices Hours: TBD

Administrative assistant: TBDOffice: NSH 4517Email: [email protected]

Tel: 268-4788

Page 2: Web-based Information Architectures MSEC 20-760 Mini II

Administrative IssuesPrerequisites

•Basic programming skills (preferably JAVA)•Familiarity with the web (HTML, browsing, etc.)•Fundamentals of Web Programming (20-753).

Grading30% homeworks (2 programming assignments)30% miniproject (student teams will propose)15% midterm (5 pages notes, calculator OK, no laptops)25% final (10 pages notes, calculator OK, no laptops)

Bulletin BoardSchedule/syllabusLecture notes (in powerpoint)HomeworkAnnouncements & discussions

Page 3: Web-based Information Architectures MSEC 20-760 Mini II

Textbook and Reference Materials (1)

Required: Class notes (slides on web site)and handouts (to be provided)

Required: "Understanding Search Engines:Mathematical Modeling and Text Retrieval"by Michael W. Berry, Murray BrowneAvailable at http://www.siam.org (tel: 1-800-447-7426)

Optional: Background reading material provided

Page 4: Web-based Information Architectures MSEC 20-760 Mini II

Textbook and Reference Materials (2)Optional: "Advances in Information Retrieval" Edited

by Croft, Kluwer Academic Pub., 2000 [more detailed state-of-the-art IR book]

Optional: "Machine Learning" by Tom M. Mitchell,WCB McGraw-Hill [Tools for textcategorization and data mining.]

Page 5: Web-based Information Architectures MSEC 20-760 Mini II

Information Retrieval: The Challenge (1)

Text DB includes:(1) Rainfall measurements in the Sahara continue to show a steadydecline starting from the first measurements in 1961. In 1996 only12mm of rain were recorded in upper Sudan, and 1mm in SouthernAlgiers...

(2) Dan Marino states that professional football risks loosing the numberone position in heart of fans across this land. Declines in TV audienceratings are cited...

(3) Alarming reductions in precipitation in desert regions are blamed fordesert encroachment of previously fertile farmland in Northern Africa.Scientists measured both yearly precipitation and groundwater levels...

Page 6: Web-based Information Architectures MSEC 20-760 Mini II

Information Retrieval: The Challenge (2)

User query states:"Decline in rainfall and impact on farms near Sahara"

Challenges•How to retrieve (1) and (3) and not (2)?•How to rank (3) as best?•How to cope with no shared words?

Page 7: Web-based Information Architectures MSEC 20-760 Mini II

Information Retrieval in eCommerce (1)

Bringing in CustomersHow do Web-search engines work?

How to maximize hits on my eCommerce pages?

How to maximize preselection of customers who willtransact?

Page 8: Web-based Information Architectures MSEC 20-760 Mini II

Information Retrieval in eCommerce (2)

Analyzing the Competition•How do we find the competition?•How will customers find the competition?•Can we do preemptive information strikes?

Text Mining•How to learn what customers want most?•How to find out what they missed, but wanted?•How to discover customer search/browsingpatterns?

Page 9: Web-based Information Architectures MSEC 20-760 Mini II

Information Retrieval Assumption (1)

Basic IR task•There exists a document collection {Dj }

•Users enters at hoc query Q

•Q correctly states user’s interest

•User wants {Di } < {Dj } most relevant to Q

Page 10: Web-based Information Architectures MSEC 20-760 Mini II

"Shared Bag of Words" assumptionEvery query = {wi }Every document = {wk }...where wi & wk in same Σ

All syntax is irrelevant (e.g. word order)All document structure is irrelevantAll meta-information is irrelevant(e.g. author, source, genre)=> Words suffice for relevance assessment

Information Retrieval Assumption (2)

Page 11: Web-based Information Architectures MSEC 20-760 Mini II

Information Retrieval Assumption (3)

Retrieval by shared words

If Q and Dj share some wi , then Relevant(Q, Dj )

If Q and Dj share all wi , then Relevant(Q, Dj )

If Q and Dj share over K% of wi , then Relevant(Q, Dj)

Page 12: Web-based Information Architectures MSEC 20-760 Mini II

Boolean Queries (1)Industrial use of SilverQ: silverR: "The Count’s silver anniversary..."

"Even the crash of ’87 had a silver lining...""The Lone Ranger lived on in syndication...""Sliver dropped to a new low in London..."...

Q: silver AND photographyR: "Posters of Tonto and the Lone Ranger..."

"The Queen’s Silver Anniversary photos..."...

Page 13: Web-based Information Architectures MSEC 20-760 Mini II

Boolean Queries (2)

Q: (silver AND (NOT anniversary)AND (NOT lining)AND emulsion)

OR (AgI AND crystalAND photography))

R: "Silver Iodide Crystals in Photography...""The emulsion was worth its weight in

silver..."...

Page 14: Web-based Information Architectures MSEC 20-760 Mini II

Boolean Queries (3)

Boolean queries are:

a) easy to implement

b) confusing to compose

c) seldom used (except by librarians)

d) prone to low recall

e) all of the above

Page 15: Web-based Information Architectures MSEC 20-760 Mini II

Beyond the Boolean Boondoggle (1)

Desiderata (1)

•Query must be natural for all users

•Sentence, phrase, or word(s)

•No AND’s, OR’s, NOT’s, ...

•No parentheses (no structure)

•System focus on important words

•Q: I want laser printers now

Page 16: Web-based Information Architectures MSEC 20-760 Mini II

Beyond the Boolean Boondoggle (2)Desiderata (2)

• Find what I mean, not just what I say Q: cheap car insurance(pAND (pOR

"cheap" [1.0]"inexpensive" [0.9]"discount" [0.5)]

(pOR "car" [1.0]"auto" [0.8]"automobile" [0.9]"vehicle" [0.5])

(pOR "insurance" [1.0]"policy" [0.3]))

Page 17: Web-based Information Architectures MSEC 20-760 Mini II

Beyond the Boolean Boondoggle (3)

Desiderata (3)

•Speech-recognized queries

•Coming soon, to a system near you

•longer queries

•more fluff words to filter

•acoustic recognition errors

Page 18: Web-based Information Architectures MSEC 20-760 Mini II

INFORMATION RETRIEVAL

The Web

Library, etc.

Spider

InvertedIndex

User

SearchEngine

Page 19: Web-based Information Architectures MSEC 20-760 Mini II

INFORMATION RETRIEVAL:APPLICATIONS

• Searching Document Archives– Libraries (title, subject, full-text)– Data bases of patents and applications– DBs of legal cases (e.g. Lexis, Westlaw)

• Searching the Web– Pure search engines (Google, Inktomi, …)– Browsing + Search (Yahoo, Terra-Lycos, …)– Meta-search (Metacrawler, Vivisimo, …)

• Corporate or Government Intranets• Non-traditional (e.g. Software DBs, News)

Page 20: Web-based Information Architectures MSEC 20-760 Mini II

INFORMATION RETRIEVAL (IR) EVOLUTION

• IR in the 1980s:– Single collection with < 106 documents (archive)– Boolean queries with unordered-set answer

• IR circa 2000:– Single collection with > 109 documents (web)– Free-form queries with ranked-list answer

• IR circa 2010:– Multiple collections > 1012 docs (invisible web)– “Find what I mean” queries with clustering,

summarization and customization.

Page 21: Web-based Information Architectures MSEC 20-760 Mini II

Content for Rest of the Course (1)[See the course BB for the latest updates to the

course schedule.]

Under the Hood•The vector space model for retrieval•Building an inverted index•Term weighting and selection•Web spidering•Automated text categorization

Page 22: Web-based Information Architectures MSEC 20-760 Mini II

Content for Rest of the Course (2)IR Uses in eCommerce

•How to make search engine work for you•How to build optimal search-attractive web sites•The business(es) of web-based information

Beyond Web Search Engines•Speech processing primer•Information extraction from web pages•Data mining primer•Multi-media applications•Business models

Page 23: Web-based Information Architectures MSEC 20-760 Mini II

Optional Quick Review of Linear AlgebraIf you know n-dimensional vectors, matrices, computing

inner products, etc.., Then you do not need this review. You may take a break.

If you learned this material, but do not remember it, please stay and listen to refresh your knowledge.

If you never learned linear algebra, stay, listen and (optionally) read either:

• G. Hadley. Linear Algebra. Addison-Wesley, 1961. Ch 3.

• Or, Stephen W. Goode. An Introduction to Differential Equations and Linear Algebra. Prentice Hall, 1991. Ch.3).


Recommended