Date post: | 19-Jan-2016 |
Category: |
Documents |
Upload: | shonda-gaines |
View: | 214 times |
Download: | 2 times |
Electronic Commerce
School of Library and Information Science
Web server administration
I. Being a webmaster
• Some tasks
• What do they really do?
II. Administering a web site
• Ecommerce infrastructure
• Analyzing server logs
III. Information security
• Security issues in ecommerce
• Technical security: SSL, Firewalls, REP
Electronic Commerce
School of Library and Information Science
What tasks must be accomplished in the design, development, and management of a successful web site?
Webmasters have to manage a range of tasks
Content Creation
Architectural Design
Implementation
Visual Design
Management
Electronic Commerce
School of Library and Information Science
Content creation
A key to a successful web site is finding the right person to provide meaningful, useful, and "well written" content
The information you present must be such that random web surfers will actually choose to return to your site because what you provide is helpful or entertaining
Selling or promoting should be a side effect to the real reason a potential customer is browsing your pages
Web site content design balances on a fine line between public service and marketing.
Every web site should have a web content developer
Electronic Commerce
School of Library and Information Science
The content developer should be a great writer
This person should be granted editorial privilege over all web content
There should be standards and templates so that the “feel”of the content remains stable if the person quits or outsources content development
Web content must capture the “spirit” of the company, person, or topic
The best web sites are summations of companies or people or topics
Electronic Commerce
School of Library and Information Science
Information architecture
When the content is in place, the next problem is how to present that content on the we.
A web or information architect is responsible for designing the work flow of the site
They will typically be good at meta-vision, flow charts, navigation templates
They will be regular web surfers who seek out and analyze new navigation metaphors and strategies constantly
The design will be developed and prototyped, tested, and modified
Electronic Commerce
School of Library and Information Science
Architectural design questions:
How deep? how wide?
How would frames affect navigation?
When is a hierarchical data structure appropriate?
When is an information cloud more efficient?
How many pages must the average user navigate before she gets to the data she requires?
The web architect must have a good knowledge of site content, a sense about how the content is used, and an idea of how it all fits together
They should watch users navigate the site over time and think of new ways to organize the data to facilitate browsing and use
Electronic Commerce
School of Library and Information Science
Implementation
With content and architecture are defined, the next step is to make it all web accessible
This requires a set of HTML pages and a web server to distribute those pages
Web technicians and web site administrators make this happen
A web technician is the person responsible for changing content into HTML documents
A web site administrator is responsible for installing, maintaining, trouble shooting, and providing security for web server hardware and software
Electronic Commerce
School of Library and Information Science
A good web technician can develop and document clear site-wide coding standards
Good code is the foundation of the web site
A good technician can write code that is standardized and easy to read so that a newly hired tech. could acclimatize in a week
This ensures continuity and that maintenance and modifications are smooth and cost efficient
The technician should also understand HTML and related content distribution technologies like CGI, Java, Real Audio/Video and Shockwave
This allows her to choose correctly between the many options for many different types of situations.
Electronic Commerce
School of Library and Information Science
The web site administrator should know
UNIX or NT server administration,
TCP/IP
Traditional services like Telnet, Email and FTP
Web security issues
Attention to detail and a firm grasp of the technologies is essential because it is a well-known fact that computer systems carefully choose the TTF (time to fail)
The web site administrator should be able to understand the needs of non-technical users and content providers and be able to explain technical issues in plain talk
Electronic Commerce
School of Library and Information Science
Visual Design
Sites need a visual designer who is responsible for logos, icons, navigation buttons, site-wide color standards, site-wide type face standards, side bars, menus, etc....
This person will be fluent in such applications as Adobe Photoshop, DeBabelizer, or Corel Draw as well as all the filters and tools for each
They will also be trained in the quirks and specifics of web graphics design as opposed to print graphics design
They should also be an avid web surfer who is always looking for new presentation tricks
Electronic Commerce
School of Library and Information Science
Management
Most large sites create a position just for managing all the network resources.
A web site manager will make sure that communication lines are quick, efficient, and open
This involves opening lines of communications outside the department
This might mean working closely with the ad/marketing department
Web site managers are facilitators and should not rule the web with an iron fist
What is crucial is that the manager knows how to bring out the best of each member of the team and create the glue to bind each part to the whole
Electronic Commerce
School of Library and Information Science
“No GUI tool will ever write well-designed and documented code. In fact, I recommend that for the next few years, all web technicians stick to simple text editors and learn how to write all their code by hand. This assures that when they do use GUI tools, they will be using those tools instead of being used by them.”
Selena Sol “What is a Webmaster” http://WDVL.com/Internet/Web/Jobs/webmaster.html
Electronic Commerce
School of Library and Information Science
So what do they really do?
Nicole Collins ([email protected]) managed (2001) the company intranet
Her duties included:
1. Meeting with department contacts to continue development of their sites
2. Coding HTML
3. Working with graphic designers to develop “home page” graphics for each department that are also web friendly
4. Training identified content owners to use Web conversion tools, such as Word Internet Assistant, to convert their own documents to HTML
Electronic Commerce
School of Library and Information Science
5. Creating graphics for lower level pages
6. Meeting with Intranet Steering Committee once a month to determine where the Intranet is going
7. Heading a monthly Web Developers group for the Internet Services group and division webmasters
8. Keeping up to date on web technologies
9. Working with Internet Services Group within IT to develop databases on the Intranet
10. Working with team to market the Intranet through promotional items
11. Delivering presentations to outside visitors, etc. about our company Intranet
Electronic Commerce
School of Library and Information Science
12. Using Adobe Acrobat to deliver forms through the Intranet
13. Writing technical user guides, etc.
14. Keeping all departments up-to-date on what they can do on the web and where their departments should go to upload and manage their pages
Electronic Commerce
School of Library and Information Science
Jason Hoch was program coordinator and web developer for MCNC MEMS Technology Applications Center http://mems.mcnc.org
“The real driver for the success of our Web site is not the HTML or graphic design but rather the creation of valuable content and the ease of finding it fast.
For example, we've added a ‘virtual cleanroom tour,’ downloadable design rules, QuickTime movies, FAQs and online order forms over the past few years.
As a result, we have loyal customers who can figure out our processes on their own using the above-mentioned tools.”
Electronic Commerce
School of Library and Information Science
His responsibilities (2001) included:
1. Design/Maintain the MEMS Web site
2. Develop marketing-focused content
3. Create graphic design to support interactive content
4. Code HTML
5. Create online order forms using CGI
6. Design rule conversion to Adobe PDF
7. Administer a Unix-based server
8. Interact with staff members/companies
9. Collaborate with engineers to get creative ideas for Web site
Electronic Commerce
School of Library and Information Science
10. Interact with companies to contribute technical and creative input for MEMSTechNet
11. Create marketing and advertising brochures that promote the MCNC Web site
12. Help other groups with HTML
13. Update corporate Web site
14. Work with IS on intranet announcements, HR postings, etc.
15. Develop sites for an MCNC spinoff company and a worldwide technology conference
16. Maintain a password-protected site that supports workgroup-type applications with scientists and researchers at the National Academy of Sciences
Electronic Commerce
School of Library and Information Science
Mark Polakow, was “webmaster”" for Telegroup International Inc., a 1/2 billion dollar Telephone Reseller
His responsibilities included:
1. Designing and maintaining Telegroup's Intranet:
Conferring with all the department heads to gather ideas on the content they want on the web
Implementing technologies to interact with data bases and render in formation in HTML
Employee training/handbook
Customer records/invoices
Phone lists and other databases
Electronic Commerce
School of Library and Information Science
2. Implement a security directory with password and digital encryption information for secured documents
3. Implement technologies such as conferencing and newsgroups
4. Train personnel in the construction and uploading of web pages, file transfer and use of necessary web apps
Train in image acquisition using scanners and imaging tools such as photoshop
5. Edit all final HTML submissions for visual congruity and proper coding
6. Run and maintain all Web servers
7. Keep current in all technologies and web developments
Electronic Commerce
School of Library and Information Science
Web server administration
I. Being a webmaster
• Some tasks
• What do they really do?
II. Administering a web site
• Ecommerce infrastructure
• Analyzing server logs
III. Information security
• Security issues in ecommerce
• Technical security: SSL, Firewalls, REP
Electronic Commerce
School of Library and Information Science
II. Administering a web site
• Ecommerce infrastructure
Server platforms
Different types of servers perform different functions
Physical server
A box (like ebiz)
Applications server
Software that resides on the physical server
Provides the “business logic” for relevant applications
Part of client-server architecture passing data from the client to the back end and product to the client
Electronic Commerce
School of Library and Information Science
Ecommerce infrastructure
Applications server
They translate raw data from a database into information with meaning displayed on a browser
They handle load balancing by distributing the computational workload
They run parts of applications that users share and communicate between desktop and back-end systems
NetDynamics (Sun), Oracle WAS, Domino (Lotus)Levinson, M. (2000). What is an Application Server? Darwin. http://www.darwinmag.com/learn/curve/column.html?ArticleID=7
Electronic Commerce
School of Library and Information Science
Ecommerce infrastructure
Commerce server
Handles the functions involved in transactions
Product display, online ordering, inventory management
Works with online payment systems
Provides
Encryption: typically SSL
Integrity: the data do not change in transmission
Authentication: they know it’s your site)
Non-repudiation: you sent them what you thought you sent them
Electronic Commerce
School of Library and Information Science
Ecommerce infrastructure
Database server
Handles DB management tasks
It monitors a port on a server) and handles all incoming requests for the underlying database data
The requests are typically SQL
It processes the SQL statement(s) and takes some action based on the the statement
This could be data retrieval, data insertion or deletion, or a security modification request from the database administrator (DBA)
Various. (1996). Java Developers’ Reference: What is a database server? http://sunsite.iisc.ernet.in/virlib/java/devref/ch25.htm#WhatIsaDatabaseServer
Electronic Commerce
School of Library and Information Science
Ecommerce infrastructure
Disk server
This is a box with oine or more hard disks that can be accessed by one or more workstations
Can be used for redundancy and load sharing
Fax server
Manages fax traffic into and out of the network
Allows faxing from the workstation
File server
Allows remote access to shared files and applications
Mail server
Sends, receives, stores, and manages email using SMTP, POP (Post Office Protocol), and MIME
Electronic Commerce
School of Library and Information Science
Ecommerce infrastructure
Proxy server
Software that is an intermediary between a workstation and the net
Provides security and admin control
Part of a gateway and/or firewall server
Caches previously viewed pages
Web server
Stores, retrieves and transmits files and software when requested through HTTP
Can create and send dynamic pages in response to user input (using CGI to pass data)
Generates server logs
Electronic Commerce
School of Library and Information Science
• Collecting and using server statistics
If you run a server, you need to know who is accessing your pages
Web servers generate logging information in the directory logs
access_log This log tells you who’s been there and where they went
agent_log What type of browser they used
referrer_log Who is linking to your pages
error_log What went wrong
Electronic Commerce
School of Library and Information Science
Every access is logged with
Client address
Time and date of access
Documents requested
Number of bytes transferred
Errors (server and script)
These data can be analyzed by hand or with log analysis tools
Electronic Commerce
School of Library and Information Science
An entry in the agent_log looks like this:
host rfc931 username [date/time] request status bytes
host DNS name or IP # of remote client
rfc931 identd-provided information about the user [“-” if none]
username UserID sent by client [“-” if none]
date/time Date/time of access in 24hr. local time
request URL requested (with filepath) surrounded in “ ”
status The status code of the server’s response*
bytes Number transferred “-” if none
An entry in the error_log looks like this:
[date/time] An error message of some sort
Electronic Commerce
School of Library and Information Science
Those pesky status codes
*status The status code of the server’s response
200 The request was successfully filled
302 URL redirected to another document
400 Bad request made by the client
403 Access to this document is forbidden\
404 Document not found
500 Internal server error
501 Application method not implemented (GET or POST)
503 Server out of resources
Electronic Commerce
School of Library and Information Science
Using these logs
You can get a reasonable sense of the number of visitors from the access_log
Place all images in an /images directory
Use a UNIX command to remove them from the access_log file:
egrep -v’/images/’access_log | wc-l
This will leave you with the number of times your documents have been downloaded
You can gather data on the aggregate demand for individual documents on the site
Electronic Commerce
School of Library and Information Science
It is possible to trace the individual’s path through the site since the access_log is chronological
How do they get in?
Which pages are the entry pages?
Where do they go?
How long do they stay?
Where do they not go?
How do they get there?
What navigation paths do they take?
Electronic Commerce
School of Library and Information Science
Using the agent_log
This log tells you the type of browser the visitor was using
When you go to a site, your client passes information about itself to the server
A typical entry looks like:
Mozilla/3.1 (Windows;I;32bit)
A more complex entry might look like this: Mozilla/3.1N (X11; I; SunOS 4.1.3_U1 sun4m) Mozilla/3.1N (X11; I; AIX 2) Mozilla/3.1N (Macintosh; I; PPC) Mozilla/3.1N (Macintosh; I; 68K) Mozilla/3.1N (X11; I; IRIX 5.2 IP22)
These are all versions of the same browser
Electronic Commerce
School of Library and Information Science
To make things more confusing, if a request passes through a proxy server, the proxy server adds its identification on to the end of the version string
Here are some more Netscape 7.1 accesses: Mozilla/7.1 (Windows; I; 32bit) via proxy gateway CERN-HTTPD/3.0 libwww/2.17
Mozilla/7.1 (Windows; I; 32bit) via proxy gateway CERN-HTTPD/3.0 libwww/2.17 via proxy gateway CERN-HTTPD/3.0 libwww/2.17
These entries will vary widely for the thirty of so different browsers that are currently in use
It is possible to aggregate this list and sort by browser
For details see:
http://members.aol.com/htmlguru/agent_log.html
Electronic Commerce
School of Library and Information Science
Using the referrer_log
This log tells you where they were when they clicked through to your page
This is useful information because it helps you figure out how many sites have linked to you
The data includes the URL of the page currently displayed by the browser when it connected to your site
This URL is called the “referring page” gets written to the referrer log, along with the document requested from your site
Electronic Commerce
School of Library and Information Science
An entry in the referrer_log looks like this:http://sunsite.unc.edu/boutell/faq/tinter.htm -> /transparent_images.html
http://webcrawler.com/cgi-bin/WebQuery -> /images.html
file:///Hard%20Drive/System%20Folder/Preferences/Netscape/Bookmarks.html
file://localhost/usr/users/la/lav6/public_html/.lynx_bookmarks.html ->
/transparent_images.html
file:///I|/HTML/Referenc.htm -> /about_html.html
The URL to the left is the referring page
The path to the right is the document requested while that page was being viewed
Electronic Commerce
School of Library and Information Science
How do they get to you?
Multiple accesses from the same page indicate a link (~5 in a monthly log)
<5 accesses may indicate that they use a bookmark
They may have also typed in your URL from another source
They found you with a query if cgi-bin and a search engine URL are in the entry
They used a locally cached page to get to you if you find file://localhost/usr/users or file://Hard%20Drive/ in the entry
If you don’t find many of your own pages in the log, people are showing up and leaving without looking around
Electronic Commerce
School of Library and Information Science
Web server administration
I. Being a webmaster
• Some tasks
• What do they really do?
II. Administering a web site
• Ecommerce infrastructure
• Analyzing server logs
III. Information security
• Security issues in ecommerce
• Technical security: SSL, Firewalls, REP
Electronic Commerce
School of Library and Information Science
III. Information security (infosec)
Digital information can easily be compromised
IS is made up of techniques and procedures used to protect and prevent unauthorized use of information
These may include: interception disclosure, alteration, substitution, or destruction of data
Ex: attack through “network sniffing” and filtering devices
These monitor the packets that move through the net
This allows authentication packets to be intercepted
If a “root” access is sniffed out; the bad guys are in
Even if lightly encrypted, this is not a challenge for a sophisticated hacker
Electronic Commerce
School of Library and Information Science
Major security issues in ecommerce
Internal network security (75% of attacks are internal)
Continued external hacking
Social engineering attacks (information warfare)
Malicious code (in applets etc)
Reliability and performance problems
Denial of service attacks (brute force attacks)
Lack of skills to properly implement and maintain security systems
Electronic Commerce
School of Library and Information Science
IS services
Confidentiality: conceal data from unauthorized parties
Integrity: assure that the data is genuine
Authentication: user is the user
Data origin authentication: the data source is the data source
Data integrity: The data is the data
Non-repudiation: the transaction occurred the way all parties thought it did (re: participants)
Availability: efficient functioning after security measures are in place
How do these services play a role in ecommerce?
Electronic Commerce
School of Library and Information Science
Protection is needed for:
Credit card and personal information transactions
Conditions of security:
The information must be inaccessible
It cannot be altered during transmission
The receiver must be able to know it came from the sender
The sender has to know that the receiver is genuine
The sender cannot deny that she sent it
It must be protected when on the server
Electronic Commerce
School of Library and Information Science
It’s also needed for:
Virtual private networks
A secure channel (tunnel) is set up on the public network
It allows two systems to use EDI through the tunnel
A high volume of data is exchanges and the parties at either end are well known to each other
Proprietary methods of encryption and authentication can be used
These networks are currently vulnerable to denial of service attacks
One problem is the instability of the network
Electronic Commerce
School of Library and Information Science
And for:
Digital certification
This involves the use of “trusted third parties” who will hold and verify digital certificates
They will authenticate users
They will also vouch for the integrity of data
Electronic Commerce
School of Library and Information Science
Technical security: Secure Sockets Layer (SSL)
SSL provides a relatively secure means to encrypt and send data over a public network
SSL 3.0 been around since March 1996
Netscape submitted it to W3C as a standard
It is an open and non-proprietary standard
It is supported by major server companies (Netscape, Microsoft, Apache)
SSL offers the core components needed to transmit sensitive data securely and to the appropriate person
It offers authentication at both the client and server sides, encryption, and integrity.
Electronic Commerce
School of Library and Information Science
SSL uses both public and secret key cryptography for authentication
Authentication begins when a client requests a connection to an SSL server
The client sends its public key to the server, which generates a random message and sends it back to the client
The client uses its private key to encrypt the message from the server and sends it back
The server decrypts the message and compares it to the original one sent to the client
If the messages match, then the server knows that it’s talking to the correct client
Electronic Commerce
School of Library and Information Science
Once the client has been authenticated, the server sends out the all important session key
This is used to encrypt and decrypt all communications between the two machines for the duration of the session
Many secret key algorithms can be used for the session key (Data Encryption Standard (DES), RSA's RC4, or the IDEA algorithm)
Most browsers support at least 40-bit RC4 encryption
Some (including Navigator 4.x and Internet Explorer 4.x) can support DES and up to 128-bit RC4.
Electronic Commerce
School of Library and Information Science
SSL can also be used to confirm the integrity of a message
SSL does this by using an MD5 message authentication code (MAC) scheme.
By using MAC, the server can compare the messagedigest with its own digest of the message sent
If the two message digests are the same, then the stream has not been tampered with
Otherwise, the server can notify the client that its data stream has been corrupted and request that the client resend the data
Electronic Commerce
School of Library and Information Science
SSL is used in conjunction with browsers and commerce servers to provide secure credit card transactions
An icon (and blue bar) lets you know when you are interacting with a secure server
It slows the transaction down because of encryption, decryption, and authentication
The security of the data is ensured when it moves from the client to the server
Electronic Commerce
School of Library and Information Science
Implementing SSL in your Web server is a relatively easy task
The basic steps include
Generating a key pair on your server
Requesting a certificate from a certification authority
Installing the certificate, and
Activating SSL on a security folder or directory
It’s not a good idea to activate SSL on all your directories because the encryption overhead it adds can significantly decrease your response times.
Electronic Commerce
School of Library and Information Science
Security on a commerce server
A server certificate proves the server’s identity and exchanges encrypted information with browsers
It is a unique “distinguished name” of the server which identifies it to visitors
It contains the server's public key
It has a Certification Authority’s s digital signature and is validated by the CA
It is installed in the web server software
The digital signature proves the identity of the signer and verifies the contents of the document
The CA digitally signs the distinguished name and public key portions of the certificate and anyone trusting the CA knows that the certificate has not been changed
Electronic Commerce
School of Library and Information Science
Public key cryptography is used to exchange bulk encryption keys and verify digital signatures
The server has public and private keys
The public key allows secure communication with the server
The server certificate contains the public key
It is sent to a browser that wants information from the server
The browser uses public key to encrypt a password that is used to encrypt the rest of the communication
Only the server’s private key that can decrypt something encrypted with its public key
The server is the only one that can successfully decrypt the password
Electronic Commerce
School of Library and Information Science
Firewalls
The purpose of a firewall is to protect critical digital data from outside attack
It also allows legitimate users internet access
The best firewall is a standalone web server
With the move to link the server to corporate database, this is not feasible
Types of firewalls
Packet filtering
Proxy server
Stateful firewall system
Electronic Commerce
School of Library and Information Science
Background
IP (Internet protocol) has two functions
To deliver packets
To fragment and reassemble packets
IP has a “protocol type identifier” that allows other protocols to run on top
TCP (IP type 6) runs on top and handles error checking and resending
This slows traffic because it takes time for TCP to work
Other services have their own TCP ports
Telnet --> port 23, SMTP --> port 25, HTTP --> port 80
Electronic Commerce
School of Library and Information Science
Clients access the server through a “high” port (>1024) on their own server
UDP (IP type 17) is “user datagram protocol”
This is a faster protocol, but at the expense of error checking
It is used for streaming audio and video, which can be “lossy”
Electronic Commerce
School of Library and Information Science
Packet filtering firewall
This is located in a router on the external border of a network
It is the simplest to implement
The router checks a list of rules when it receives an internal or external request
The rules allow and restrict access based on source, destination, and type (IP protocol type, TCP, UDP port #)
Ex: Allow all SMTP email to the server --> any IP address can send email to TCP port 25
Ex: Block telnet requests to the server --> all requests to TCP port 23 are blocked
Electronic Commerce
School of Library and Information Science
Typical rules would involve allowing traffic to all ports below 1024 used by server services and blocking access to all the rest
Also allowing access to ports above 1024 unless used by an internal service (ex: Microsoft’s SQL server used port 1033)
The weakness of this firewall is that is allows attacks on those ports which allow access through the firewall
It also cannot stop or detect the use of allowed ports above 1024 for evil purposes
Electronic Commerce
School of Library and Information Science
Proxy server firewall
The proxy server is a machine that is on a separate network (the DMZ) that has direct access to the net
It assumes that client machines should not have direct access to the net
It requires special configuration of the client machines so that they do not access the net
Clients make requests of the proxy server
The proxy server checks its list of rules and if the request is accepted, it retrieves the information and delivers it to the client
All external unwanted traffic is kept off the local network
Electronic Commerce
School of Library and Information Science
The only data that passes through the firewall is that which is allowed by the system’s access rules
This system cannot be compromised without reconfiguring the proxy server
Proxy servers can also cache frequently requested information speeding downloads
Problems
They do add a layer of management and administrative responsibility
Many new internet tools and protocols (streaming audio) cannot be supported by proxy servers
Electronic Commerce
School of Library and Information Science
Stateful firewall systems
This is new, more secure firewall system which combines packet filtering and proxy servers
This requires no special configuration of the client machines
It can apply rules to allow or deny access
It analyzes network traffic that passes through because it understands the different protocols
This is the advance: recognition of types of traffic instead of just port #
Electronic Commerce
School of Library and Information Science
They allow an unregistered network to be run behind the firewall
This conserves IP # for the domain
They also support “virtual private networks”
This is secure communication over the net (using encryption)
Remote users can gain encrypted access to the local net, allowing telework
Electronic Commerce
School of Library and Information Science
Controlling access with “robot exclusion protocol”
Robots are automated programs, which walk the web and catalog everything they find in a database.
These data might drive a search engine, develop statistics, or fill a private repository of links
In addition to building search engine databases
Some search only for specific documents to create focused collections for a user, or detect page theft and copyright violations
Some test URLs, ensuring that a site is not suffering from web-rot
Some conduct performance and availability tests to see if sites are online and measure how long it takes to download pages
Electronic Commerce
School of Library and Information Science
Problems with robots
Unchecked, robots saturate a site with requests, however innocent
Real visitors can't get in while a robot is consuming all the available bandwidth to a site
These are “unintentional denial of service attacks” because of the tremendous numbers of requests that can be generated automatically
In response to complaints, robot authors forced their robots to wait, often several minutes or more, between requests to a site
Electronic Commerce
School of Library and Information Science
The next problem concerns access statistics
Logs show a site’s access count growing, but the vast majority of visitors are robots
Robots now include identifying strings as part of the HTTP header they sent to the server
A third problem was that webmasters began seeing parts of their sites in search engine databases that they didn't want indexed
These included CGI scripts, images, prototype pages, and other private data.
In response, the “Robot Exclusion Protocol” [REP] was developed, allowing authors to specify exactly who is allowed to index what on any site
Electronic Commerce
School of Library and Information Science
The Robot Exclusion Protocol
REP lets a Webmaster specify which robots, based upon their identifying strings, can access their site
For those robots allowed in, specific parts of the site can be made available for access
This requires a single file in the site which controls access for all robots
The file is named “robots.txt” and must be placed in the top-level document directory on the site
Well-behaved robots will read this file before visiting the site, and will only access the site if the file grants them access
Electronic Commerce
School of Library and Information Science
Within this file three kinds of lines can be used
Comment lines
Any line beginning with a “#” character is a comment and is ignored by the robot
It makes sense to comment your entries in the file, if only for your later perusal
User agent lines
One or more user agent lines, are used to specify robots which can access and which are denied access
Resource control lines
One or more resource control lines specify areas of the site to which access is blocked
Blank lines separate the sets but may not appear within them
Electronic Commerce
School of Library and Information Science
Example of “robots.txt”
Suppose you've discovered that a spider named “Howard” has been visiting your CGI script directory
You restrict access with this robots.txt file:
# Sample robots.txt file
User-agent: howard
Disallow: /cgi-bin
Disallow: ~
The second “Disallow” line restricts access to all user files that begin with “~”
Electronic Commerce
School of Library and Information Science
You can use a wild card to
Force all robots to pay attention
User-agent: *
And to block off your entire site
Disallow: /
To block all robots from your entire site, use:
# Sample robots.txt file
User-agent: *
Disallow: /
You can also permit specific robots in with
Allow: Howard
Electronic Commerce
School of Library and Information Science
A second part of REP is the robot <meta> tag
If you cannot create or modify the “robots.txt” file on your site due to access restrictions, you can still control robots access to your pages
Any HTML page can use a <meta> tag to control how a robot indexes the page
For this <meta> tag, you supply a name attribute of robots
The content attribute can contain the values “noindex” and “nofollow”
To prevent a page from being included in an index, in the <HEAD> of the document, you would place
<meta name="robots" content="noindex">
Electronic Commerce
School of Library and Information Science
This keeps the page from being added to the index but does not prevent the robot from parsing the page, extracting any URLs, and visiting those pages
To prevent the robot from traveling beyond this page, use the “nofollow” value for the content attribute
You can prevent both indexing and follow-on by combining the values:
<meta name="robots" content="noindex, nofollow">
The only disadvantage to using this tag is that few robots currently honor it
Electronic Commerce
School of Library and Information Science
Resources
CIO Magazine (1998). “Webmaster: The Job”http://www.cio.com/forums/wmf_the_job.html
Musciano, C. (1996). Collecting and using server statistics. SunWorld Online.
http://www.sun.com/sunworldonline/swol-03-1996/swol-03 webmaster.html
Robot Exclusion Protocol Draft Specification. (1998)http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html
Sol, S. (1998). “What is a Webmaster” http://WDVL.com/Internet/Web/Jobs/webmaster.html
Web Robots Page. (1998)http://info.webcrawler.com/mak/projects/robots/robots.html