+ All Categories
Home > Documents > Searching in All the Right Places

Searching in All the Right Places

Date post: 30-Dec-2015
Category:
Upload: blaine-maxwell
View: 20 times
Download: 3 times
Share this document with a friend
Description:
Searching in All the Right Places. The Obvious and Familiar To find tax information, ask the tax office Libraries Online Many college and public libraries let you access their online catalogs and other information resources - PowerPoint PPT Presentation
37
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Locating Information on the WWW Searching for Truth lawrence snyder c h a p t e r 5
Transcript
Page 1: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Locating Information on the WWW Searching for Truth

lawrence snyder

c h a p t e r 5

Page 2: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-2

Searching in All the Right Places

• The Obvious and Familiar– To find tax information, ask the tax office

• Libraries Online– Many college and public libraries let you

access their online catalogs and other information resources

• Libraries provide online facilities that are well organized and trustworthy

• Remember that many pre-1985 documents are not yet available online

• Plus Librarians are real live experts

Page 3: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-3

Page 4: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-4

Page 5: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-5

Page 6: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-6

How Is Information Organized?

• Hierarchical classification (like a family tree)

• Information is grouped into a small number of categories, each of which is easily described (top-level classification)

• Information in each category is divided into subcategories (second-level classifications), and so on

• Eventually the classifications become small enough for you to look through the whole category to find the information you need– This is a process of elimination as much as choosing

appropriate subcategories

Page 7: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-7

Page 8: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-8

Important Properties of Classifications

• Descriptive terms must cover all the information in the category and be easy for a searcher to apply

• Subcategories do not all have to use the same classifications

• Information in the category defines how best to classify it

• There is no single way to classify information

Page 9: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-9

How is Web Site Information Organized?

• Homepage is the top-level classification for the whole Web site

• Classifications are the roots of hierarchies that organize large volumes of similar types of information

• Topic clusters are sets of related links

– For example, sidebar and top of page navigation links

• Content information often fills the rest of a page

Page 10: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-10

Page 11: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-11

Alternate Hierarchy Presentations

• Top level classifications can be expanded individually for next level information

• Alternately, a tabular form of the tree can be presented for a broader picture at a glance (sometimes called site map)

• Our NPR homepage example offers both forms

Page 12: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-12

Page 13: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-13

Design of Hierarchies

• General rules for design and terminology of hierarchies

– Root is usually at the top (branching metaphor)

• "Going up in the hierarchy" means the classifications becomes more inclusive or general

• "Going down in the hierarchy" means the classifications become more specific or detailed

• The greater-than (>) symbol is a common way to show going down through levels of classification

Page 14: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-14

Levels in a Hierarchy

• A one-level hierarchy has only one level of "branching"—no subdirectories

• To count levels, remember– There is always a root– There are always "leaves"—the categories themselves– The root and leaves do not count as levels

• The NPR hierarchy, drawn as a tree, shows 2 classification levels between the root (homepage) and the leaves (content pages)

Page 15: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-15

Page 16: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-16

Other Hierarchy Considerations

• Groupings may overlap (one item can appear in more than one category), or be partitioned (every category appears only once)

• Number of levels may differ by category, even in the same hierarchical tree

• A single path from root to leaf is a full classification of the leaf content – Home > Music > Browse Artists > C > Cave Singers

• “Tree of Life” biological taxonomy for humans is a path

Page 17: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-17

Page 18: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-18

Searching the Web for Information

• Individual web sites are carefully organized by their designers (hierarchically, for example)

• But… no one organizes the entire Web, and it has grown unimaginably HUGE… too huge to just browse looking for specific items

• Search engines solve the problem

• Popular Search Engines: Google, Yahoo!, MSN, AOL, Ask

Page 19: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-19

Search Engine Basics

• A search engine has two basic parts

– Crawler: Constantly runs, visits sites on the Internet, discovering Web pages and building/updating an index to the content it finds

– Query processor: Looks up user-submitted keywords in the index and reports back a list of Web pages the crawler has found containing those words

Page 20: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-20

Crawlers

• Build index telling which Web pages (URLs) contain which words, based on their HTML text.

• When a crawler visits a web page it:

1. Adds all tokens (words) on the page into the index (words from the title, the body content, anchor text, META tags)

2. Associates the URL for the page with each of these words

3. Then visits all pages that are linked to the page being examined, and does steps 1 through 3 on each

• Crawlers can miss pages

– If no page points to it

– If a page is dynamically created on-the-fly

– Page has only images, or unknown type (not HTML, PDF, etc.)

Page 21: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Page 22: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-22

Query Processors

• User submits one or more keywords (the query)

• Index is consulted for these keywords, producing a list of web page URLs found by the crawler

• Important to give a good query to get a useful list of pages in reply

• Query not specific Huge list of unwanted URLs

Page 23: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-23

Multiword Index Searches

• List of several keywords is an AND-query– red fish blue guppy

• Very common (default) use, means each found page must contain ALL the words

• Look in the index for the URL list for each word, then scan the lists for URLs common to all

• URL lists in indexes are often alphabetized to make this faster

Page 24: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-24

Advanced Searches

• Search engines allow complex queries to get a smaller, more useful list of returned pages (see Google’s Advanced Search page)

• Logical operators– AND: Tells search engine to return only pages containing all

termsred AND fish AND blue AND guppy

– OR: find pages containing any word given, including pages where 2 or more appear

marshmallow OR strawberry OR chocolate– NOT/-: Excludes pages containing the given word– Combinations:

• tigers AND NOT baseball• (chocolate OR strawberry) AND sundae• Simpson bart OR lisa OR maggie –homer –marge

– Use parentheses to make your intent clear

Page 25: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-25

Page 26: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-26

Effective Queries: Narrow the Hitlist

• Suppose you are writing a report on red giant stars. You issue the Google query

red giant

and get 4.9 million hits (URLs)… now what?

• First few pages deal with software and rock bands, so try again… with some restrictions

red giant –software –music

• Now we have 824,000 hits… better, still big, but the early URLs are somehow the ones we need

Page 27: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Page 28: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-28

Ordering the Hits

• How did Google decide on the order for the 824,000 URLs, and put the “best” ones up top?

• Order is determined by relevance several ways

• Top URLs have “red giant” together on the page, in the order given in the query

• Later down the list are pages with “giant red”, or words separated, or words in anchor text

• Google enhances this with a relevance score called PageRank for each page

Page 29: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-29

Page Rank

• Count the links into a page (“important” pages are pointed to by lots of other pages)– Each page that links to a target page is

considered a "vote" for that target page

• If the "voting page" is itself highly ranked, this ups the PageRank for the target page

• Words in anchor text up the PageRank• Crawler computes this as it indexes• Complete details of the PageRank

algorithm is Google proprietary information

Page 30: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-30

Further Constraining Search

• Some web sites offer search limited only to the pages of that site

• We can often focus a search by limiting it to URLs in specific domains (like .gov or .edu) or to specific sites (like www.youtube.com).

• This allows Google’s PageRank to order well the hits we get from the restricted domains

Page 31: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-31

Web Information: Truth or Fiction?

• Anyone can publish anything on the web

– Note prevalence of blogs and wikis

• Some of what gets published is false, misleading, deceptive, self-serving, slanderous, or disgusting

– If it is on the web it must be true. – NOT!

• How do we know if the pages we find in our search are reliable?

Page 32: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-32

Do Not Assume Too Much

• Registered domain names may be misleading or deliberate hoaxes

– www.whitehouse.gov vs. www.whitehouse.org vs. www.whitehouse.com

• Look for who or what organization publishes the Web page

– Respected organizations publish the best information

• A two-step check for the site's publisher1. InterNIC (www.internic.net/whois.html) provides the

name of the company that assigned the site's IP address, and a link to the WhoIs server maintained by that company

2. Go to the WhoIs Server site and type the domain name or IP address again.– Information returned is the owner's name and physical

address

Page 33: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-33

Characteristics of Legitimate Sites

• Web sites are most believable if they have these features:

– Physical Existence—Site provides a street address, phone number, e-mail address

– Expertise—Site includes references, citations or credentials, related links

– Clarity—Site is well organized, easy to use, and has site-searching facilities

– Currency—Site was recently updated– Professionalism—Site's grammar, spelling, and

punctuation are correct; all links work

Page 34: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-34

Check and Double Check

• Remember that a site can have many of the features of legitimacy and still not be authoritative. – Example: http://www.dhmo.org/

(Hoax about dangers of Dihydrogen monoxide – H2O)

• Use known authoritative sites to cross check, or consult respected debunkers ( like snopes.com )

• When in doubt, check it out. Ask a librarian.

• Test your assessment skills… check out the Burmese Mountain Dog web page

Page 35: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-35

Page 36: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-36

Summary

• Libraries are excellent primary resource tools

• Large libraries have extensive online resources

• Libraries not only provide information digitally, they also connect us with “pre-digital” archives -- the millions of books, journals, and manuscripts that still exist only in paper form

• We need software and our own intelligence to search the Internet effectively

Page 37: Searching in All the Right Places

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley5-37

Summary

• We create search queries using the logical operators AND, OR, NOT, and specific terms to pinpoint the information we seek

• Once we’ve found information, we must judge whether it is correct by investigating the organization that publishes the page, including checking the credentials of the people who write the content.

• We must cross-check the information with other sources, especially when the information is important


Recommended