+ All Categories
Home > Education > Lecture08

Lecture08

Date post: 29-Jan-2015
Category:
Upload: rishi-gupta
View: 70 times
Download: 0 times
Share this document with a friend
Description:
web engineering
Popular Tags:
61
WWW Search Engines WWW Search Engines CSC1720 – Introduction to CSC1720 – Introduction to Internet Internet Essential Materials Essential Materials
Transcript
Page 1: Lecture08

WWW Search EnginesWWW Search Engines

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

Essential MaterialsEssential Materials

Page 2: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

22

OutlineOutline

IntroductionIntroduction Directories, Search Engines, Directories, Search Engines,

Metasearch EnginesMetasearch Engines Search FundamentalsSearch Fundamentals Search StrategiesSearch Strategies How does a search engine work?How does a search engine work? Searching TipsSearching Tips Your site’s ranking?Your site’s ranking? SummarySummary

Page 3: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

33

IntroductionIntroduction

You have probably been using search You have probably been using search engines, but perhaps may not be as engines, but perhaps may not be as effectively as possible.effectively as possible.

A lot of information is available on-line, A lot of information is available on-line, but not all of them is completely but not all of them is completely accurate.accurate.

The web-page addresses are The web-page addresses are constantly changing, it may be only constantly changing, it may be only available for a short time.available for a short time.

Page 4: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

44

Search Engine HistorySearch Engine History

In 1990, before the WWW, Alan In 1990, before the WWW, Alan Emtage created Archie, the first search Emtage created Archie, the first search tool for finding files on FTP sites.tool for finding files on FTP sites.

In 1993, Veronica is developed. In 1993, Veronica is developed. Followed by Jughead, Wandex, …Followed by Jughead, Wandex, …

In 1994, Galaxy, WebCrawler, Yahoo! In 1994, Galaxy, WebCrawler, Yahoo! and Lycos debuted.and Lycos debuted.

In 1995 and afterwards, Excite, In 1995 and afterwards, Excite, Infoseek, Alta Vista, MetaCrawler, …Infoseek, Alta Vista, MetaCrawler, …

Next generation: specialized hybrids Next generation: specialized hybrids

Page 5: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

55

DirectoriesDirectories

A Web Directory or Web Guide is A Web Directory or Web Guide is a hierarchical representation of a hierarchical representation of hyperlinks.hyperlinks.

The top level is typically a wide The top level is typically a wide range of very general topics.range of very general topics.

Each topic contains hyperlinks of Each topic contains hyperlinks of more specialized sub-topics.more specialized sub-topics.

Very easy to use.Very easy to use.

Page 6: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

66

Hierarchical Hierarchical RepresentationRepresentation

Page 7: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

77

Popular DirectoriesPopular Directories

AOL anywhere – AOL anywhere – search.search.aolaol.com.com CNET Search.com – CNET Search.com – www.search.comwww.search.com Excite – Excite – www.excite.comwww.excite.com E-Wild life – E-Wild life – www.www.ewildlifeewildlife.com.com Lycos – Lycos – www.www.lycoslycos.com.com Yahoo! – Yahoo! – www.yahoo.comwww.yahoo.com Google – Google – www.www.googlegoogle.com.com

Page 8: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

88

Some figuresSome figures

Page 9: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

99

Search EnginesSearch Engines

A search engine is a computer A search engine is a computer program that does the following:program that does the following:– Allows user to submit a query that Allows user to submit a query that

consists of a word / phaseconsists of a word / phase– Searches the databaseSearches the database– Returns a list of suitable URLs which Returns a list of suitable URLs which

match your query.match your query.– Allows user to revise and resubmit.Allows user to revise and resubmit.

Page 10: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

1010

Where to submit Where to submit Query?Query?

Submit your Query

Page 11: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

1111

Popular Search Popular Search EnginesEngines AOL anywhere – AOL anywhere – search.search.aolaol.com.com AltaVista – AltaVista – altavistaaltavista.digital.com.digital.com Excite – Excite – www.excite.comwww.excite.com HotBot – HotBot – www.hotpot.comwww.hotpot.com Magellan – Magellan – www.www.mckinleymckinley.com.com Google – Google – www.www.googlegoogle.com.com

Page 12: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

1212

Metasearch EnginesMetasearch Engines

A metasearch or all-in-one search engine A metasearch or all-in-one search engine performs a search by the use of more than performs a search by the use of more than one other search engine to complete the one other search engine to complete the search job.search job.

The duplicate retrievals are eliminated.The duplicate retrievals are eliminated. The results are ranked according to how well The results are ranked according to how well

they match with the query.they match with the query. Advantage:Advantage:

– A single query can access lot of search engines.A single query can access lot of search engines. Disadvantage:Disadvantage:

– A high noise-to-signal ratio, lot of matches will not A high noise-to-signal ratio, lot of matches will not be suitable for you.be suitable for you.

Page 13: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

1313

Popular Metasearch Popular Metasearch EnginesEngines Metasearch – Metasearch – www.www.metasearchmetasearch

.com.com Metacrawler – Metacrawler – www.www.metacrawlermetacrawler

.com.com MetaFind – MetaFind – www.www.metafindmetafind.com.com Dogpile – www.dogpile.comDogpile – www.dogpile.com

Page 14: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

1414

Some FiguresSome Figures

Page 15: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

1515

White Pages / Yellow White Pages / Yellow PagesPages White pages allows user to lookup White pages allows user to lookup

information about individuals.information about individuals. We can use white page to track down We can use white page to track down

the telephone numbers, email address.the telephone numbers, email address. People can abuse white pagesPeople can abuse white pages Some people think that white pages Some people think that white pages

are an invasion of their privacy.are an invasion of their privacy. Yellow pagesYellow pages contain information contain information

about businesses.about businesses.

Page 16: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

1616

Popular White Pages & Popular White Pages & Yellow PagesYellow Pages Bigfoot – www.bigfoot.comBigfoot – www.bigfoot.com Yahoo! People Search – Yahoo! People Search –

people.yahoo.compeople.yahoo.com WhoWhere – www.whowhere.comWhoWhere – www.whowhere.com

Yahoo! Yellow Page – yp.yahoo.comYahoo! Yellow Page – yp.yahoo.com SuperPages – www.superpages.comSuperPages – www.superpages.com

Page 17: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

1717

Some Figures – Some Figures – White PagesWhite Pages

Page 18: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

1818

Some Figures – Some Figures – Yellow PagesYellow Pages

Page 19: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

1919

ComparisonComparison

DirectoryDirectory Search EngineSearch Engine

A directory allows you to explore A directory allows you to explore and get what you want and get what you want eventually.eventually.

A search engine brings you to A search engine brings you to the exact page on the words or the exact page on the words or phrases you are looking for.phrases you are looking for.

Use a directory to find cooking-Use a directory to find cooking-related websites.related websites.

Use a search engine to find a Use a search engine to find a specific recipe, by providing the specific recipe, by providing the name of the ingredients.name of the ingredients.

Use a directory to find travel Use a directory to find travel guides in a country.guides in a country.

Use a search engine to find the Use a search engine to find the transport trains schedule in transport trains schedule in South Africa.South Africa.

Page 20: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

2020

Search FundamentalsSearch Fundamentals

Example: www.yahoo.comExample: www.yahoo.com Header:Header: Yahoo Logo and some advertising. Yahoo Logo and some advertising. Information bar:Information bar: contains other hyperlinks. contains other hyperlinks. Search form area:Search form area: consists a form which consists a form which

allows you to type a query.allows you to type a query. Directory area:Directory area: a large number of a large number of

categories, channels.categories, channels. Yahoo Links:Yahoo Links: Link to other yahoo sites. Link to other yahoo sites. Footer:Footer: contains information about yahoo, contains information about yahoo,

copyright and a disclaimer.copyright and a disclaimer.

Page 21: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

2121

Search FundamentalsSearch Fundamentals

Page 22: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

2222

Search TerminologySearch Terminology

Search Tool:Search Tool: Any mean to locating information on Any mean to locating information on the Internet.the Internet.

Query:Query: Information typed into the form on the Information typed into the form on the search engine.search engine.

Query syntax:Query syntax: Rules for constructing a valid query. Rules for constructing a valid query. Query semantics:Query semantics: Rules for defining the meaning Rules for defining the meaning

of a query.of a query. Hit/Match:Hit/Match: A URL that the search engine returns A URL that the search engine returns

for a specific query.for a specific query. Relevancy score:Relevancy score: A value that indicates the quality A value that indicates the quality

of the URL (match close to the query 1 to 100).of the URL (match close to the query 1 to 100).

Page 23: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

2323

Pattern Matching Pattern Matching QueriesQueries It is also called Fuzzy Query.It is also called Fuzzy Query. You can enter “ungrammatical sentences”, You can enter “ungrammatical sentences”,

“incomplete sentence fragments”, “disjoint “incomplete sentence fragments”, “disjoint phrases”, “nonsense words”.phrases”, “nonsense words”.

The search engine gets a collection of The search engine gets a collection of keywords.keywords.

Required keyword: Mark with “+” before Required keyword: Mark with “+” before the keyword.the keyword.

Prohibited keyword: Mark with “-” before Prohibited keyword: Mark with “-” before the keyword.the keyword.

Page 24: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

2424

Pattern Matching Pattern Matching QueriesQueries

Page 25: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

2525

Boolean QueriesBoolean Queries

A Boolean Query is a query that consists A Boolean Query is a query that consists keywords but with logical operators (AND, keywords but with logical operators (AND, OR, NOT).OR, NOT).

X AND YX AND Y – will return URLs that contain – will return URLs that contain both X and Y.both X and Y.

X OR YX OR Y – will return URLs that contain – will return URLs that contain either X or Y.either X or Y.

X AND NOT YX AND NOT Y – will return URLs that – will return URLs that contain X and do not contain Y.contain X and do not contain Y.

Symbol: Symbol: ANDAND - - &&, , OROR - - ||, , NOTNOT - - !!, , NEARNEAR - - ~~

Page 26: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

2626

Boolean QueriesBoolean Queries

AND is used for narrowing a queryAND is used for narrowing a query – If you know that your target documents will If you know that your target documents will

contain a group of keywords, list them contain a group of keywords, list them using the AND operatorusing the AND operator

OR is used for broadening a queryOR is used for broadening a query– If you can think of related words for a topic, If you can think of related words for a topic,

list them using the OR operatorlist them using the OR operator NOT is used to redirect a queryNOT is used to redirect a query

– If you find that a keyword or phrase is If you find that a keyword or phrase is leading irrelevant hits, then represent it in leading irrelevant hits, then represent it in your query as AND NOT your query as AND NOT keywordkeyword

Page 27: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

2727

Boolean QueriesBoolean Queries

Page 28: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

2828

Using WildcardsUsing Wildcards

Wildcards are useful for retrieving Wildcards are useful for retrieving variations of a wordvariations of a word

For example, art* will search for art, For example, art* will search for art, artwork, artist, artistry, and so forth artwork, artist, artistry, and so forth

An excellent way to broaden a searchAn excellent way to broaden a search Different wildcard characters are Different wildcard characters are

used by different search enginesused by different search engines The most common characters are: *, The most common characters are: *,

#, and ?#, and ?

Page 29: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

2929

Advanced Search Advanced Search OptionsOptions

Page 30: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

3030

Break Time – 10 Break Time – 10 minutesminutes

Page 31: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

3131

Search StrategiesSearch Strategies

You should find a search engine that You should find a search engine that meets the following conditions:meets the following conditions:– A user-friendly interfaceA user-friendly interface– Easy-to-understand documentationEasy-to-understand documentation– Convenient to accessConvenient to access– A large indexed databaseA large indexed database– Assigning good relevancy scores.Assigning good relevancy scores.

Learn the syntax of this particular search Learn the syntax of this particular search engine, but not several different engines.engine, but not several different engines.

Page 32: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

3232

Search GeneralizationSearch Generalization

Too few hits?Too few hits?– Needs to generalize your search query.Needs to generalize your search query.

Pattern matching query: eliminate one Pattern matching query: eliminate one of the more specific keywords of the of the more specific keywords of the query.query.

Boolean query: remove the keywords Boolean query: remove the keywords with with AND AND operator, or delete the operator, or delete the NOT NOT item, or use the item, or use the OROR operator. operator.

Use a directory or metasearch engine Use a directory or metasearch engine if still cannot locate the matched URL.if still cannot locate the matched URL.

Page 33: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

3333

Search SpecializationSearch Specialization

Too many hits?Too many hits?– Needs to specialize your search query.Needs to specialize your search query.

Pattern matching query: add more Pattern matching query: add more keywords.keywords.

Boolean query: use Boolean query: use ANDAND with other with other keyword, or add keyword, or add NOTNOT operator to operator to excluded some unwanted pages.excluded some unwanted pages.

Try capitalizing proper nouns or names.Try capitalizing proper nouns or names. Use a directory to locate your Use a directory to locate your

information.information.

Page 34: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

3434

Sample SearchesSample Searches

Queries about Kayaking in AlaskaQueries about Kayaking in Alaska Example: Using Example: Using infoseekinfoseek

Query:Query: No. of Hits No. of Hitsalaskaalaska 176,954176,954AlaskaAlaska 176,064176,064+”Prince William Sound” +Alaska+”Prince William Sound” +Alaska 778778+kayak +”Prince William Sound” +Alaska +kayak +”Prince William Sound” +Alaska 4444+kayaking +”Prince William Sound” +Alaska+kayaking +”Prince William Sound” +Alaska 6060+kayaking +”Prince William Sound” +Alaska +rental+kayaking +”Prince William Sound” +Alaska +rental

2020

Page 35: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

3535

How does it work?How does it work?

User InterfaceUser Interface – Allows you to type a – Allows you to type a query and displays the results.query and displays the results.

Searcher Searcher – The engine searches the – The engine searches the database for matching your query.database for matching your query.

Evaluator Evaluator – The engine assigns scores – The engine assigns scores to the retrieved information.to the retrieved information.

Gatherer Gatherer – The component that travels – The component that travels the WEB, and collects information.the WEB, and collects information.

IndexerIndexer – The engine that categorizes – The engine that categorizes the data collected by the gatherer.the data collected by the gatherer.

Page 36: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

3636

User InterfaceUser Interface

Provides a mechanism for a user to Provides a mechanism for a user to submit queries to the search engine.submit queries to the search engine.

Uses forms, very user friendly.Uses forms, very user friendly. The user interface displays the The user interface displays the

search results in a convenient way.search results in a convenient way. A summary of each matched page is A summary of each matched page is

shown.shown.

Page 37: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

3737

SearcherSearcher

It is a program that uses the search It is a program that uses the search engine’s database to locate the matches engine’s database to locate the matches for a specific query.for a specific query.

The database of a search engine holds The database of a search engine holds extremely large indexed pages.extremely large indexed pages.

A highly efficient search algorithm is A highly efficient search algorithm is necessary.necessary.

Computer Scientists have spent years to Computer Scientists have spent years to develop the searching and sorting develop the searching and sorting methods.methods.

You can refer to computer books.You can refer to computer books.

Page 38: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

3838

EvaluatorEvaluator

The searcher returns a set of URLs that The searcher returns a set of URLs that match your query.match your query.

Not all of the hits equally match your Not all of the hits equally match your query.query.

More references to the page, the ranking More references to the page, the ranking of the page will be higher.of the page will be higher.

How the relevancy score is calculated?How the relevancy score is calculated?– Varies from one engine to another one.Varies from one engine to another one.– The number of times of the word appears?The number of times of the word appears?– The query words appear in the title?The query words appear in the title?– The query words appear in the META tag?The query words appear in the META tag?

Page 39: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

3939

Link PopularityLink Popularity

reference

Page 40: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

4040

GathererGatherer

It is a program that traverses the It is a program that traverses the Web and gathers information Web and gathers information about the Web documents.about the Web documents.

It runs at a short and regular It runs at a short and regular intervals.intervals.

It returns information and will be It returns information and will be indexed to the database.indexed to the database.

Alternate names: Bot, Crawler, Alternate names: Bot, Crawler, Robot, Spider and Worm.Robot, Spider and Worm.

Page 41: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

4141

SpiderlistSpiderlist http://www.spiderhunter.com/http://www.spiderhunter.com/

Page 42: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

4242

IndexerIndexer

It organizes the data by creating a set It organizes the data by creating a set of keys or an index.of keys or an index.

Indexes need to be rebuilt frequently.Indexes need to be rebuilt frequently. E.g. Libraries – Author, Title, ISBN, E.g. Libraries – Author, Title, ISBN,

etc…etc… In order to ensure the returned URL is In order to ensure the returned URL is

not out of date.not out of date. The search engine is very complex and The search engine is very complex and

needs to break down into different needs to break down into different components.components.

Page 43: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

4343

Case Study - AltaVistaCase Study - AltaVista

Sending out Crawlers (robot programs) that capture Sending out Crawlers (robot programs) that capture information from the web and bring them back.information from the web and bring them back.

The main crawler – “Scooter” simultaneously send The main crawler – “Scooter” simultaneously send out HTTP requests like blind users on the Web.out HTTP requests like blind users on the Web.

Store all these information to the indexing engine.Store all these information to the indexing engine. Scooter’s cousins help to remove “dead” links.Scooter’s cousins help to remove “dead” links. A typical day, Scooter will visit over 10 million A typical day, Scooter will visit over 10 million

pages.pages. Web pages with no links referencing will never be Web pages with no links referencing will never be

found.found. You can also submit your URL to AltaVista.You can also submit your URL to AltaVista.

Page 44: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

4444

Case Study - AltaVistaCase Study - AltaVista

METAtagsMETAtags – special keywords – special keywords embedded in the headers of the embedded in the headers of the webpage.webpage.

Full-text indexFull-text index – Every word on – Every word on every page is also included during every page is also included during searching.searching.

AltaVista is using Full-text AltaVista is using Full-text indexing.indexing.

Page 45: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

4545

METAtag ExampleMETAtag Example

Page 46: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

4646

Case Study - AltaVistaCase Study - AltaVista

Limit a search to a domainLimit a search to a domain E.g. searching “edu” domainE.g. searching “edu” domain +domain:edu +”molecular +domain:edu +”molecular

biophysics”biophysics” The above query would only search The above query would only search

for molecular biophysics at for molecular biophysics at educational institutions.educational institutions.

Here is a list of Top-level Internet Here is a list of Top-level Internet DomainsDomains

Page 47: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

4747

Searching TipsSearching Tips

Be naturalBe natural– Is cell phone harmful?Is cell phone harmful?– Ask the search engine : “Cell phone” AND Ask the search engine : “Cell phone” AND

harmfulharmful CapitalizeCapitalize

– Always use lowercaseAlways use lowercase– star will search “Star, STAR, stAr, …”star will search “Star, STAR, stAr, …”– Type “Star” unless you really want to Type “Star” unless you really want to

search “Star”.search “Star”.

Page 48: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

4848

Searching TipsSearching Tips

Use uncommon keywordsUse uncommon keywords– The more specific results will return to The more specific results will return to

you.you.– Think a valid and uncommon keyword.Think a valid and uncommon keyword.

Require wordsRequire words– Add a “+” before the keyword.Add a “+” before the keyword.– It will be in every match.It will be in every match.

Exclude wordsExclude words– Use “-” before the keyword.Use “-” before the keyword.– In what situation should we use?In what situation should we use?

Page 49: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

4949

Searching TipsSearching Tips

Correct SpellingCorrect Spelling– Beware of the differences between English Beware of the differences between English

and American spellings (Color, Colour) and American spellings (Color, Colour) (color OR colour)(color OR colour)

Stop wordsStop words– Ignore the most common words “the, is, Ignore the most common words “the, is,

…”…”– ““searching the web” and the search searching the web” and the search

engine will ignore “the web”.engine will ignore “the web”.– Add more relevant keyword.Add more relevant keyword.

Page 50: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

5050

Searching TipsSearching Tips

Use wildcardsUse wildcards– Use “*” in some search engines.Use “*” in some search engines.– ““funk*” funk*” funk, funky, funkiest, … funk, funky, funkiest, …

Solve dead linksSolve dead links– If the search engine returns If the search engine returns

http://www.hit.com/a/b/c.html which is a http://www.hit.com/a/b/c.html which is a dead link.dead link.

– You can try http://www.hit.com/a/b/You can try http://www.hit.com/a/b/– Or http://www.hit/com/a/ …Or http://www.hit/com/a/ …

Page 51: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

5151

Factors affect your Factors affect your site’s Rankingsite’s Ranking Keyword prominenceKeyword prominence Keyword frequencyKeyword frequency Keyword weightKeyword weight Keyword proximityKeyword proximity Keyword placementKeyword placement Click popularity & StickinessClick popularity & Stickiness

Page 52: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

5252

Keyword ProminenceKeyword Prominence

How early in a web site do the How early in a web site do the keywords first appear?keywords first appear?– The first element in HTML is the title The first element in HTML is the title

tagtag– What happen if your title is:What happen if your title is:

This is my homepageThis is my homepage Welcome to my company’s homepageWelcome to my company’s homepage

Include the keywords in head, Meta Include the keywords in head, Meta tag, early in the body, …tag, early in the body, …

Page 53: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

5353

Keyword FrequencyKeyword Frequency

Search engine may determines Search engine may determines your site’s popularity by checking your site’s popularity by checking how frequently the keyword or how frequently the keyword or phrase appears on the page.phrase appears on the page.

What is the problem if you put too What is the problem if you put too many same keywords into one many same keywords into one single page?single page?

Page 54: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

5454

Keyword WeightKeyword Weight

It is also called keyword densityIt is also called keyword density Measure by comparing the Measure by comparing the

number of keywords appearing on number of keywords appearing on the web page with the total the web page with the total number of words on the page.number of words on the page.

In most case, we try not to In most case, we try not to exceed a keyword weight of 3 to exceed a keyword weight of 3 to 10 percent.10 percent.

Page 55: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

5555

Keyword DensityKeyword Density

reference

reference

Page 56: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

5656

Keyword ProximityKeyword Proximity

The placement of keywords on a web The placement of keywords on a web page in relation to each other is page in relation to each other is measured in “Keyword Proximity”.measured in “Keyword Proximity”.

““Home loans” will outrank a citation Home loans” will outrank a citation about “home mortgage loans”.about “home mortgage loans”.

E.g.E.g.– Smith Brothers Inc has been selling Smith Brothers Inc has been selling puppy foodpuppy food

for over 50 years.for over 50 years.– Smith Brothers Inc has been sellingSmith Brothers Inc has been selling food food for for

your your puppiespuppies for over 50 years. for over 50 years.

Page 57: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

5757

Keyword PlacementKeyword Placement

Search engines favor web sites that Search engines favor web sites that contain keywords in:contain keywords in:– The title tagThe title tag– The keyword META tagThe keyword META tag– The headline tag <H1> …The headline tag <H1> …– The first 25 words of bodyThe first 25 words of body– HyperlinksHyperlinks– Image <ALT> tagsImage <ALT> tags– Text near the end of the documentText near the end of the document

Page 58: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

5858

Click popularity & Click popularity & StickinessStickiness Click popularity Click popularity is a measure of the number of is a measure of the number of

clicks received by each site in a search engine's clicks received by each site in a search engine's results page. results page.

Stickiness Stickiness is a measure of the amount of time is a measure of the amount of time a user spends at a site. It's calculated according a user spends at a site. It's calculated according to the time that elapses between each of the to the time that elapses between each of the user's clicks on the search engine's results user's clicks on the search engine's results page. page.

Reference: http://www.directhit.com/Reference: http://www.directhit.com/

Page 59: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

5959

Submit your site to Submit your site to search enginessearch engines Google – 5 pages/day, Excite – 25 pages/weekGoogle – 5 pages/day, Excite – 25 pages/week

Page 60: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

6060

SummarySummary

Use different resources to find/search Use different resources to find/search different kinds of information.different kinds of information.

Use successive query refinement to Use successive query refinement to achieve effective search queries.achieve effective search queries.

Think carefully for the keywords typed in Think carefully for the keywords typed in the search engine.the search engine.

Use Boolean queries when you need Use Boolean queries when you need combinations of keywords.combinations of keywords.

Think carefully when you create your Think carefully when you create your own homepage, can it be easily indexed own homepage, can it be easily indexed by search engines?by search engines?

Page 61: Lecture08

CSC1720 – Introduction to CSC1720 – Introduction to InternetInternet

All copyrights reserved by C.C. Cheung All copyrights reserved by C.C. Cheung 2003.2003.

6161

ReferencesReferences

searchenginewatch.comsearchenginewatch.com Information retrievalInformation retrieval Search Engine Positioning – Fredrick MarcSearch Engine Positioning – Fredrick Marc

kini (Wordware Publishing Inc.)kini (Wordware Publishing Inc.)

The End.The End. Thank you for your patience!Thank you for your patience!


Recommended