+ All Categories
Home > Documents > Electronic Commerce School of Library and Information Science Web server administration I. Being a...

Electronic Commerce School of Library and Information Science Web server administration I. Being a...

Date post: 19-Jan-2016
Category:
Upload: shonda-gaines
View: 214 times
Download: 2 times
Share this document with a friend
72
Electronic Commerce School of Library and Information Scien Web server administration I. Being a webmaster Some tasks What do they really do? II. Administering a web site Ecommerce infrastructure Analyzing server logs III. Information security Security issues in ecommerce Technical security: SSL, Firewalls, REP
Transcript

Electronic Commerce

School of Library and Information Science

Web server administration

I. Being a webmaster

• Some tasks

• What do they really do?

II. Administering a web site

• Ecommerce infrastructure

• Analyzing server logs

III. Information security

• Security issues in ecommerce

• Technical security: SSL, Firewalls, REP

Electronic Commerce

School of Library and Information Science

What tasks must be accomplished in the design, development, and management of a successful web site?

Webmasters have to manage a range of tasks

Content Creation

Architectural Design

Implementation

Visual Design

Management

Electronic Commerce

School of Library and Information Science

Content creation

A key to a successful web site is finding the right person to provide meaningful, useful, and "well written" content

The information you present must be such that random web surfers will actually choose to return to your site because what you provide is helpful or entertaining

Selling or promoting should be a side effect to the real reason a potential customer is browsing your pages

Web site content design balances on a fine line between public service and marketing.

Every web site should have a web content developer

Electronic Commerce

School of Library and Information Science

The content developer should be a great writer

This person should be granted editorial privilege over all web content

There should be standards and templates so that the “feel”of the content remains stable if the person quits or outsources content development

Web content must capture the “spirit” of the company, person, or topic

The best web sites are summations of companies or people or topics

Electronic Commerce

School of Library and Information Science

Information architecture

When the content is in place, the next problem is how to present that content on the we.

A web or information architect is responsible for designing the work flow of the site

They will typically be good at meta-vision, flow charts, navigation templates

They will be regular web surfers who seek out and analyze new navigation metaphors and strategies constantly

The design will be developed and prototyped, tested, and modified

Electronic Commerce

School of Library and Information Science

Architectural design questions:

How deep? how wide?

How would frames affect navigation?

When is a hierarchical data structure appropriate?

When is an information cloud more efficient?

How many pages must the average user navigate before she gets to the data she requires?

The web architect must have a good knowledge of site content, a sense about how the content is used, and an idea of how it all fits together

They should watch users navigate the site over time and think of new ways to organize the data to facilitate browsing and use

Electronic Commerce

School of Library and Information Science

Implementation

With content and architecture are defined, the next step is to make it all web accessible

This requires a set of HTML pages and a web server to distribute those pages

Web technicians and web site administrators make this happen

A web technician is the person responsible for changing content into HTML documents

A web site administrator is responsible for installing, maintaining, trouble shooting, and providing security for web server hardware and software

Electronic Commerce

School of Library and Information Science

A good web technician can develop and document clear site-wide coding standards

Good code is the foundation of the web site

A good technician can write code that is standardized and easy to read so that a newly hired tech. could acclimatize in a week

This ensures continuity and that maintenance and modifications are smooth and cost efficient

The technician should also understand HTML and related content distribution technologies like CGI, Java, Real Audio/Video and Shockwave

This allows her to choose correctly between the many options for many different types of situations.

Electronic Commerce

School of Library and Information Science

The web site administrator should know

UNIX or NT server administration,

TCP/IP

Traditional services like Telnet, Email and FTP

Web security issues

Attention to detail and a firm grasp of the technologies is essential because it is a well-known fact that computer systems carefully choose the TTF (time to fail)

The web site administrator should be able to understand the needs of non-technical users and content providers and be able to explain technical issues in plain talk

Electronic Commerce

School of Library and Information Science

Visual Design

Sites need a visual designer who is responsible for logos, icons, navigation buttons, site-wide color standards, site-wide type face standards, side bars, menus, etc....

This person will be fluent in such applications as Adobe Photoshop, DeBabelizer, or Corel Draw as well as all the filters and tools for each

They will also be trained in the quirks and specifics of web graphics design as opposed to print graphics design

They should also be an avid web surfer who is always looking for new presentation tricks

Electronic Commerce

School of Library and Information Science

Management

Most large sites create a position just for managing all the network resources.

A web site manager will make sure that communication lines are quick, efficient, and open

This involves opening lines of communications outside the department

This might mean working closely with the ad/marketing department

Web site managers are facilitators and should not rule the web with an iron fist

What is crucial is that the manager knows how to bring out the best of each member of the team and create the glue to bind each part to the whole

Electronic Commerce

School of Library and Information Science

“No GUI tool will ever write well-designed and documented code. In fact, I recommend that for the next few years, all web technicians stick to simple text editors and learn how to write all their code by hand. This assures that when they do use GUI tools, they will be using those tools instead of being used by them.”

Selena Sol “What is a Webmaster” http://WDVL.com/Internet/Web/Jobs/webmaster.html

Electronic Commerce

School of Library and Information Science

So what do they really do?

Nicole Collins ([email protected]) managed (2001) the company intranet

Her duties included:

1. Meeting with department contacts to continue development of their sites

2. Coding HTML

3. Working with graphic designers to develop “home page” graphics for each department that are also web friendly

4. Training identified content owners to use Web conversion tools, such as Word Internet Assistant, to convert their own documents to HTML

Electronic Commerce

School of Library and Information Science

5. Creating graphics for lower level pages

6. Meeting with Intranet Steering Committee once a month to determine where the Intranet is going

7. Heading a monthly Web Developers group for the Internet Services group and division webmasters

8. Keeping up to date on web technologies

9. Working with Internet Services Group within IT to develop databases on the Intranet

10. Working with team to market the Intranet through promotional items

11. Delivering presentations to outside visitors, etc. about our company Intranet

Electronic Commerce

School of Library and Information Science

12. Using Adobe Acrobat to deliver forms through the Intranet

13. Writing technical user guides, etc.

14. Keeping all departments up-to-date on what they can do on the web and where their departments should go to upload and manage their pages

Electronic Commerce

School of Library and Information Science

Jason Hoch was program coordinator and web developer for MCNC MEMS Technology Applications Center http://mems.mcnc.org

“The real driver for the success of our Web site is not the HTML or graphic design but rather the creation of valuable content and the ease of finding it fast.

For example, we've added a ‘virtual cleanroom tour,’ downloadable design rules, QuickTime movies, FAQs and online order forms over the past few years.

As a result, we have loyal customers who can figure out our processes on their own using the above-mentioned tools.”

Electronic Commerce

School of Library and Information Science

His responsibilities (2001) included:

1. Design/Maintain the MEMS Web site

2. Develop marketing-focused content

3. Create graphic design to support interactive content

4. Code HTML

5. Create online order forms using CGI

6. Design rule conversion to Adobe PDF

7. Administer a Unix-based server

8. Interact with staff members/companies

9. Collaborate with engineers to get creative ideas for Web site

Electronic Commerce

School of Library and Information Science

10. Interact with companies to contribute technical and creative input for MEMSTechNet

11. Create marketing and advertising brochures that promote the MCNC Web site

12. Help other groups with HTML

13. Update corporate Web site

14. Work with IS on intranet announcements, HR postings, etc.

15. Develop sites for an MCNC spinoff company and a worldwide technology conference

16. Maintain a password-protected site that supports workgroup-type applications with scientists and researchers at the National Academy of Sciences

Electronic Commerce

School of Library and Information Science

Mark Polakow, was “webmaster”" for Telegroup International Inc., a 1/2 billion dollar Telephone Reseller

His responsibilities included:

1. Designing and maintaining Telegroup's Intranet:

Conferring with all the department heads to gather ideas on the content they want on the web

Implementing technologies to interact with data bases and render in formation in HTML

Employee training/handbook

Customer records/invoices

Phone lists and other databases

Electronic Commerce

School of Library and Information Science

2. Implement a security directory with password and digital encryption information for secured documents

3. Implement technologies such as conferencing and newsgroups

4. Train personnel in the construction and uploading of web pages, file transfer and use of necessary web apps

Train in image acquisition using scanners and imaging tools such as photoshop

5. Edit all final HTML submissions for visual congruity and proper coding

6. Run and maintain all Web servers

7. Keep current in all technologies and web developments

Electronic Commerce

School of Library and Information Science

Web server administration

I. Being a webmaster

• Some tasks

• What do they really do?

II. Administering a web site

• Ecommerce infrastructure

• Analyzing server logs

III. Information security

• Security issues in ecommerce

• Technical security: SSL, Firewalls, REP

Electronic Commerce

School of Library and Information Science

II. Administering a web site

• Ecommerce infrastructure

Server platforms

Different types of servers perform different functions

Physical server

A box (like ebiz)

Applications server

Software that resides on the physical server

Provides the “business logic” for relevant applications

Part of client-server architecture passing data from the client to the back end and product to the client

Electronic Commerce

School of Library and Information Science

Ecommerce infrastructure

Applications server

They translate raw data from a database into information with meaning displayed on a browser

They handle load balancing by distributing the computational workload

They run parts of applications that users share and communicate between desktop and back-end systems

NetDynamics (Sun), Oracle WAS, Domino (Lotus)Levinson, M. (2000). What is an Application Server? Darwin. http://www.darwinmag.com/learn/curve/column.html?ArticleID=7

Electronic Commerce

School of Library and Information Science

Ecommerce infrastructure

Commerce server

Handles the functions involved in transactions

Product display, online ordering, inventory management

Works with online payment systems

Provides

Encryption: typically SSL

Integrity: the data do not change in transmission

Authentication: they know it’s your site)

Non-repudiation: you sent them what you thought you sent them

Electronic Commerce

School of Library and Information Science

Ecommerce infrastructure

Database server

Handles DB management tasks

It monitors a port on a server) and handles all incoming requests for the underlying database data

The requests are typically SQL

It processes the SQL statement(s) and takes some action based on the the statement

This could be data retrieval, data insertion or deletion, or a security modification request from the database administrator (DBA)

Various. (1996). Java Developers’ Reference: What is a database server? http://sunsite.iisc.ernet.in/virlib/java/devref/ch25.htm#WhatIsaDatabaseServer

Electronic Commerce

School of Library and Information Science

Ecommerce infrastructure

Disk server

This is a box with oine or more hard disks that can be accessed by one or more workstations

Can be used for redundancy and load sharing

Fax server

Manages fax traffic into and out of the network

Allows faxing from the workstation

File server

Allows remote access to shared files and applications

Mail server

Sends, receives, stores, and manages email using SMTP, POP (Post Office Protocol), and MIME

Electronic Commerce

School of Library and Information Science

Ecommerce infrastructure

Proxy server

Software that is an intermediary between a workstation and the net

Provides security and admin control

Part of a gateway and/or firewall server

Caches previously viewed pages

Web server

Stores, retrieves and transmits files and software when requested through HTTP

Can create and send dynamic pages in response to user input (using CGI to pass data)

Generates server logs

Electronic Commerce

School of Library and Information Science

• Collecting and using server statistics

If you run a server, you need to know who is accessing your pages

Web servers generate logging information in the directory logs

access_log This log tells you who’s been there and where they went

agent_log What type of browser they used

referrer_log Who is linking to your pages

error_log What went wrong

Electronic Commerce

School of Library and Information Science

Every access is logged with

Client address

Time and date of access

Documents requested

Number of bytes transferred

Errors (server and script)

These data can be analyzed by hand or with log analysis tools

Electronic Commerce

School of Library and Information Science

An entry in the agent_log looks like this:

host rfc931 username [date/time] request status bytes

host DNS name or IP # of remote client

rfc931 identd-provided information about the user [“-” if none]

username UserID sent by client [“-” if none]

date/time Date/time of access in 24hr. local time

request URL requested (with filepath) surrounded in “ ”

status The status code of the server’s response*

bytes Number transferred “-” if none

An entry in the error_log looks like this:

[date/time] An error message of some sort

Electronic Commerce

School of Library and Information Science

Those pesky status codes

*status The status code of the server’s response

200 The request was successfully filled

302 URL redirected to another document

400 Bad request made by the client

403 Access to this document is forbidden\

404 Document not found

500 Internal server error

501 Application method not implemented (GET or POST)

503 Server out of resources

Electronic Commerce

School of Library and Information Science

Using these logs

You can get a reasonable sense of the number of visitors from the access_log

Place all images in an /images directory

Use a UNIX command to remove them from the access_log file:

egrep -v’/images/’access_log | wc-l

This will leave you with the number of times your documents have been downloaded

You can gather data on the aggregate demand for individual documents on the site

Electronic Commerce

School of Library and Information Science

It is possible to trace the individual’s path through the site since the access_log is chronological

How do they get in?

Which pages are the entry pages?

Where do they go?

How long do they stay?

Where do they not go?

How do they get there?

What navigation paths do they take?

Electronic Commerce

School of Library and Information Science

Using the agent_log

This log tells you the type of browser the visitor was using

When you go to a site, your client passes information about itself to the server

A typical entry looks like:

Mozilla/3.1 (Windows;I;32bit)

A more complex entry might look like this: Mozilla/3.1N (X11; I; SunOS 4.1.3_U1 sun4m) Mozilla/3.1N (X11; I; AIX 2) Mozilla/3.1N (Macintosh; I; PPC) Mozilla/3.1N (Macintosh; I; 68K) Mozilla/3.1N (X11; I; IRIX 5.2 IP22)

These are all versions of the same browser

Electronic Commerce

School of Library and Information Science

To make things more confusing, if a request passes through a proxy server, the proxy server adds its identification on to the end of the version string

Here are some more Netscape 7.1 accesses: Mozilla/7.1 (Windows; I; 32bit) via proxy gateway CERN-HTTPD/3.0 libwww/2.17

Mozilla/7.1 (Windows; I; 32bit) via proxy gateway CERN-HTTPD/3.0 libwww/2.17 via proxy gateway CERN-HTTPD/3.0 libwww/2.17

These entries will vary widely for the thirty of so different browsers that are currently in use

It is possible to aggregate this list and sort by browser

For details see:

http://members.aol.com/htmlguru/agent_log.html

Electronic Commerce

School of Library and Information Science

Using the referrer_log

This log tells you where they were when they clicked through to your page

This is useful information because it helps you figure out how many sites have linked to you

The data includes the URL of the page currently displayed by the browser when it connected to your site

This URL is called the “referring page” gets written to the referrer log, along with the document requested from your site

Electronic Commerce

School of Library and Information Science

An entry in the referrer_log looks like this:http://sunsite.unc.edu/boutell/faq/tinter.htm -> /transparent_images.html

http://webcrawler.com/cgi-bin/WebQuery -> /images.html

file:///Hard%20Drive/System%20Folder/Preferences/Netscape/Bookmarks.html

file://localhost/usr/users/la/lav6/public_html/.lynx_bookmarks.html ->

/transparent_images.html

file:///I|/HTML/Referenc.htm -> /about_html.html

The URL to the left is the referring page

The path to the right is the document requested while that page was being viewed

Electronic Commerce

School of Library and Information Science

How do they get to you?

Multiple accesses from the same page indicate a link (~5 in a monthly log)

<5 accesses may indicate that they use a bookmark

They may have also typed in your URL from another source

They found you with a query if cgi-bin and a search engine URL are in the entry

They used a locally cached page to get to you if you find file://localhost/usr/users or file://Hard%20Drive/ in the entry

If you don’t find many of your own pages in the log, people are showing up and leaving without looking around

Electronic Commerce

School of Library and Information Science

Web server administration

I. Being a webmaster

• Some tasks

• What do they really do?

II. Administering a web site

• Ecommerce infrastructure

• Analyzing server logs

III. Information security

• Security issues in ecommerce

• Technical security: SSL, Firewalls, REP

Electronic Commerce

School of Library and Information Science

III. Information security (infosec)

Digital information can easily be compromised

IS is made up of techniques and procedures used to protect and prevent unauthorized use of information

These may include: interception disclosure, alteration, substitution, or destruction of data

Ex: attack through “network sniffing” and filtering devices

These monitor the packets that move through the net

This allows authentication packets to be intercepted

If a “root” access is sniffed out; the bad guys are in

Even if lightly encrypted, this is not a challenge for a sophisticated hacker

Electronic Commerce

School of Library and Information Science

Major security issues in ecommerce

Internal network security (75% of attacks are internal)

Continued external hacking

Social engineering attacks (information warfare)

Malicious code (in applets etc)

Reliability and performance problems

Denial of service attacks (brute force attacks)

Lack of skills to properly implement and maintain security systems

Electronic Commerce

School of Library and Information Science

IS services

Confidentiality: conceal data from unauthorized parties

Integrity: assure that the data is genuine

Authentication: user is the user

Data origin authentication: the data source is the data source

Data integrity: The data is the data

Non-repudiation: the transaction occurred the way all parties thought it did (re: participants)

Availability: efficient functioning after security measures are in place

How do these services play a role in ecommerce?

Electronic Commerce

School of Library and Information Science

Protection is needed for:

Credit card and personal information transactions

Conditions of security:

The information must be inaccessible

It cannot be altered during transmission

The receiver must be able to know it came from the sender

The sender has to know that the receiver is genuine

The sender cannot deny that she sent it

It must be protected when on the server

Electronic Commerce

School of Library and Information Science

It’s also needed for:

Virtual private networks

A secure channel (tunnel) is set up on the public network

It allows two systems to use EDI through the tunnel

A high volume of data is exchanges and the parties at either end are well known to each other

Proprietary methods of encryption and authentication can be used

These networks are currently vulnerable to denial of service attacks

One problem is the instability of the network

Electronic Commerce

School of Library and Information Science

And for:

Digital certification

This involves the use of “trusted third parties” who will hold and verify digital certificates

They will authenticate users

They will also vouch for the integrity of data

Electronic Commerce

School of Library and Information Science

Technical security: Secure Sockets Layer (SSL)

SSL provides a relatively secure means to encrypt and send data over a public network

SSL 3.0 been around since March 1996

Netscape submitted it to W3C as a standard

It is an open and non-proprietary standard

It is supported by major server companies (Netscape, Microsoft, Apache)

SSL offers the core components needed to transmit sensitive data securely and to the appropriate person

It offers authentication at both the client and server sides, encryption, and integrity.

Electronic Commerce

School of Library and Information Science

SSL uses both public and secret key cryptography for authentication

Authentication begins when a client requests a connection to an SSL server

The client sends its public key to the server, which generates a random message and sends it back to the client

The client uses its private key to encrypt the message from the server and sends it back

The server decrypts the message and compares it to the original one sent to the client

If the messages match, then the server knows that it’s talking to the correct client

Electronic Commerce

School of Library and Information Science

Once the client has been authenticated, the server sends out the all important session key

This is used to encrypt and decrypt all communications between the two machines for the duration of the session

Many secret key algorithms can be used for the session key (Data Encryption Standard (DES), RSA's RC4, or the IDEA algorithm)

Most browsers support at least 40-bit RC4 encryption

Some (including Navigator 4.x and Internet Explorer 4.x) can support DES and up to 128-bit RC4.

Electronic Commerce

School of Library and Information Science

SSL can also be used to confirm the integrity of a message

SSL does this by using an MD5 message authentication code (MAC) scheme.

By using MAC, the server can compare the messagedigest with its own digest of the message sent

If the two message digests are the same, then the stream has not been tampered with

Otherwise, the server can notify the client that its data stream has been corrupted and request that the client resend the data

Electronic Commerce

School of Library and Information Science

SSL is used in conjunction with browsers and commerce servers to provide secure credit card transactions

An icon (and blue bar) lets you know when you are interacting with a secure server

It slows the transaction down because of encryption, decryption, and authentication

The security of the data is ensured when it moves from the client to the server

Electronic Commerce

School of Library and Information Science

Implementing SSL in your Web server is a relatively easy task

The basic steps include

Generating a key pair on your server

Requesting a certificate from a certification authority

Installing the certificate, and

Activating SSL on a security folder or directory

It’s not a good idea to activate SSL on all your directories because the encryption overhead it adds can significantly decrease your response times.

Electronic Commerce

School of Library and Information Science

Security on a commerce server

A server certificate proves the server’s identity and exchanges encrypted information with browsers

It is a unique “distinguished name” of the server which identifies it to visitors

It contains the server's public key

It has a Certification Authority’s s digital signature and is validated by the CA

It is installed in the web server software

The digital signature proves the identity of the signer and verifies the contents of the document

The CA digitally signs the distinguished name and public key portions of the certificate and anyone trusting the CA knows that the certificate has not been changed

Electronic Commerce

School of Library and Information Science

Public key cryptography is used to exchange bulk encryption keys and verify digital signatures

The server has public and private keys

The public key allows secure communication with the server

The server certificate contains the public key

It is sent to a browser that wants information from the server

The browser uses public key to encrypt a password that is used to encrypt the rest of the communication

Only the server’s private key that can decrypt something encrypted with its public key

The server is the only one that can successfully decrypt the password

Electronic Commerce

School of Library and Information Science

Firewalls

The purpose of a firewall is to protect critical digital data from outside attack

It also allows legitimate users internet access

The best firewall is a standalone web server

With the move to link the server to corporate database, this is not feasible

Types of firewalls

Packet filtering

Proxy server

Stateful firewall system

Electronic Commerce

School of Library and Information Science

Background

IP (Internet protocol) has two functions

To deliver packets

To fragment and reassemble packets

IP has a “protocol type identifier” that allows other protocols to run on top

TCP (IP type 6) runs on top and handles error checking and resending

This slows traffic because it takes time for TCP to work

Other services have their own TCP ports

Telnet --> port 23, SMTP --> port 25, HTTP --> port 80

Electronic Commerce

School of Library and Information Science

Clients access the server through a “high” port (>1024) on their own server

UDP (IP type 17) is “user datagram protocol”

This is a faster protocol, but at the expense of error checking

It is used for streaming audio and video, which can be “lossy”

Electronic Commerce

School of Library and Information Science

Packet filtering firewall

This is located in a router on the external border of a network

It is the simplest to implement

The router checks a list of rules when it receives an internal or external request

The rules allow and restrict access based on source, destination, and type (IP protocol type, TCP, UDP port #)

Ex: Allow all SMTP email to the server --> any IP address can send email to TCP port 25

Ex: Block telnet requests to the server --> all requests to TCP port 23 are blocked

Electronic Commerce

School of Library and Information Science

Typical rules would involve allowing traffic to all ports below 1024 used by server services and blocking access to all the rest

Also allowing access to ports above 1024 unless used by an internal service (ex: Microsoft’s SQL server used port 1033)

The weakness of this firewall is that is allows attacks on those ports which allow access through the firewall

It also cannot stop or detect the use of allowed ports above 1024 for evil purposes

Electronic Commerce

School of Library and Information Science

Proxy server firewall

The proxy server is a machine that is on a separate network (the DMZ) that has direct access to the net

It assumes that client machines should not have direct access to the net

It requires special configuration of the client machines so that they do not access the net

Clients make requests of the proxy server

The proxy server checks its list of rules and if the request is accepted, it retrieves the information and delivers it to the client

All external unwanted traffic is kept off the local network

Electronic Commerce

School of Library and Information Science

The only data that passes through the firewall is that which is allowed by the system’s access rules

This system cannot be compromised without reconfiguring the proxy server

Proxy servers can also cache frequently requested information speeding downloads

Problems

They do add a layer of management and administrative responsibility

Many new internet tools and protocols (streaming audio) cannot be supported by proxy servers

Electronic Commerce

School of Library and Information Science

Stateful firewall systems

This is new, more secure firewall system which combines packet filtering and proxy servers

This requires no special configuration of the client machines

It can apply rules to allow or deny access

It analyzes network traffic that passes through because it understands the different protocols

This is the advance: recognition of types of traffic instead of just port #

Electronic Commerce

School of Library and Information Science

They allow an unregistered network to be run behind the firewall

This conserves IP # for the domain

They also support “virtual private networks”

This is secure communication over the net (using encryption)

Remote users can gain encrypted access to the local net, allowing telework

Electronic Commerce

School of Library and Information Science

Controlling access with “robot exclusion protocol”

Robots are automated programs, which walk the web and catalog everything they find in a database.

These data might drive a search engine, develop statistics, or fill a private repository of links

In addition to building search engine databases

Some search only for specific documents to create focused collections for a user, or detect page theft and copyright violations

Some test URLs, ensuring that a site is not suffering from web-rot

Some conduct performance and availability tests to see if sites are online and measure how long it takes to download pages

Electronic Commerce

School of Library and Information Science

Problems with robots

Unchecked, robots saturate a site with requests, however innocent

Real visitors can't get in while a robot is consuming all the available bandwidth to a site

These are “unintentional denial of service attacks” because of the tremendous numbers of requests that can be generated automatically

In response to complaints, robot authors forced their robots to wait, often several minutes or more, between requests to a site

Electronic Commerce

School of Library and Information Science

The next problem concerns access statistics

Logs show a site’s access count growing, but the vast majority of visitors are robots

Robots now include identifying strings as part of the HTTP header they sent to the server

A third problem was that webmasters began seeing parts of their sites in search engine databases that they didn't want indexed

These included CGI scripts, images, prototype pages, and other private data.

In response, the “Robot Exclusion Protocol” [REP] was developed, allowing authors to specify exactly who is allowed to index what on any site

Electronic Commerce

School of Library and Information Science

The Robot Exclusion Protocol

REP lets a Webmaster specify which robots, based upon their identifying strings, can access their site

For those robots allowed in, specific parts of the site can be made available for access

This requires a single file in the site which controls access for all robots

The file is named “robots.txt” and must be placed in the top-level document directory on the site

Well-behaved robots will read this file before visiting the site, and will only access the site if the file grants them access

Electronic Commerce

School of Library and Information Science

Within this file three kinds of lines can be used

Comment lines

Any line beginning with a “#” character is a comment and is ignored by the robot

It makes sense to comment your entries in the file, if only for your later perusal

User agent lines

One or more user agent lines, are used to specify robots which can access and which are denied access

Resource control lines

One or more resource control lines specify areas of the site to which access is blocked

Blank lines separate the sets but may not appear within them

Electronic Commerce

School of Library and Information Science

Example of “robots.txt”

Suppose you've discovered that a spider named “Howard” has been visiting your CGI script directory

You restrict access with this robots.txt file:

# Sample robots.txt file

User-agent: howard

Disallow: /cgi-bin

Disallow: ~

The second “Disallow” line restricts access to all user files that begin with “~”

Electronic Commerce

School of Library and Information Science

You can use a wild card to

Force all robots to pay attention

User-agent: *

And to block off your entire site

Disallow: /

To block all robots from your entire site, use:

# Sample robots.txt file

User-agent: *

Disallow: /

You can also permit specific robots in with

Allow: Howard

Electronic Commerce

School of Library and Information Science

A second part of REP is the robot <meta> tag

If you cannot create or modify the “robots.txt” file on your site due to access restrictions, you can still control robots access to your pages

Any HTML page can use a <meta> tag to control how a robot indexes the page

For this <meta> tag, you supply a name attribute of robots

The content attribute can contain the values “noindex” and “nofollow”

To prevent a page from being included in an index, in the <HEAD> of the document, you would place

<meta name="robots" content="noindex">

Electronic Commerce

School of Library and Information Science

This keeps the page from being added to the index but does not prevent the robot from parsing the page, extracting any URLs, and visiting those pages

To prevent the robot from traveling beyond this page, use the “nofollow” value for the content attribute

You can prevent both indexing and follow-on by combining the values:

<meta name="robots" content="noindex, nofollow">

The only disadvantage to using this tag is that few robots currently honor it

Electronic Commerce

School of Library and Information Science

Resources

CIO Magazine (1998). “Webmaster: The Job”http://www.cio.com/forums/wmf_the_job.html

Musciano, C. (1996). Collecting and using server statistics. SunWorld Online.

http://www.sun.com/sunworldonline/swol-03-1996/swol-03 webmaster.html

Robot Exclusion Protocol Draft Specification. (1998)http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html

Sol, S. (1998). “What is a Webmaster” http://WDVL.com/Internet/Web/Jobs/webmaster.html

Web Robots Page. (1998)http://info.webcrawler.com/mak/projects/robots/robots.html


Recommended