Web Engineering Basic Technologies: Protocols and Web · PDF fileWeb Engineering Basic...

transcript

Web Engineering Basic Technologies: Protocols and Web Servers

Husni@trunojoyo.ac.id

Basic Web Technologies

• HTTP and HTML

• Web Servers

• Proxy Servers

• Content Delivery Networks

Where we will be later in the course ...

Where we will be later in the course .......

• Supporting a range of client devices

World-Wide Web

• Series of Protocols• URL/URI unique identification of resources

• URI examples• http://www.inf.ethz.ch/education• mailto: john.smith@inf.ethz.ch• ftp://ftp.inf.ethz.ch/ed/report.txt• tel:+41-44-6321234• ....

• URL is a URI that provides information about how to locate a resource

• HTTP Hypertext Transfer Protocol• HTML Hypertext Mark-Up Language

• Web Browsers• Internet Explorer, Mozilla Firefox …..

HTML<html>

<head>

<title>Michael's Personal Home Page</title>

</head>

<h1> Michael </h1>

Michael works at <a href="http://www.ethz.ch"> ETH Zurich </a>

<h2>Personal</h2>

<address> CNB E106 <br/> Zurich <br/> Switzerland </address>

</body>

</html>

HTML …

• Based on Hypertext Style of Navigation

• Simple and Easy to Publish on Web

• Structure, Content or Presentation?• wide use of table elements for formatting layout• address elements describe the content

• Problems of• link maintenance• document interpretation

• Flexible• unknown tags ignored by browsers => easy to extend with customised tags

HTML ......

• Document meta data can be included in header

<head>

<title>Michael's Personal Home Page</title>

</head>

• Keywords used by search engines

HTML5 : The Next Generation of HTML

• New standard for HTML, XHTML and HTML DOM

• Work in progress, most browsers now have some support

• Cooperation between W3C and Web Hypertext Application Technology Working Group (WHATWG)

• One goal was to have a clearer separation of content andpresentation• HTML5 - content• CSS3 - layout as well as look and feel

• Second goal to make it easier to process documents and their content

• Third goal to reduce the need for plug-ins

Key Features of HTML5

• Tags to support a stronger document model to make it easier to identify logical elements of documents• section, article, aside, details, header, footer …

• Support for other media types• video, audio …

• Take over some of the things normally handled by JavaScript such as form field validation• form field input types such as email, url, dates, numbers ….• far richer set of event attributes

• Support for client-side storage• replacing cookies

HTTP 1.0

• hypertext transfer protocol

• one object transferred per connection

HTTP request

GET /www/globis.html HTTP/1.0

Accept: www/source

Accept: text/html

Accept: image/gif

User-Agent: Lynx/2.4 libwww/2.14

HTTP result

HTTP/1.0 200 OKDate: Thursday, 23-April-98 09:00:05 GMTServer: NCSA/1.4.2MIME-version: 1.0Content-type: text/htmlContent-length: 3500

note blank linebetween headerand body ofmessage

HTML Form

• GET /cgi-bin/globis.pl?user=moira&pass=fred

HTML Form

<html>

Name: <input type="text" name="user" size=10>

Password: <input type="password" name="pass" size=6>

</form>

</html>

Introducing Dynamic Content

• need to introduce some mechanism to execute programs on theserver side and dynamically generate HTML documents

CGI Programming

• Common Gateway Interface

• Executes Programs on Server Side

CGI Result

CGI Programs

• Can be written in any language

• Desirable Features• ease of text manipulation

• ability to access environment variables

• ability to interface with other services

• Commonly Used Languages• Perl, C/C++, Tcl, Java

Accessing Form Data

#!/usr/local/bin/perl

print "Content-type: text/html", "\n\n";

$query_string=$ENV {'QUERY_STRING'};

($field_name, $param) = split (/=/, $query_string);

if ($user eq "moira" ) {

print "Location: /globis.html", "\n\n";

} else ……

Unix Environment Variables

SERVER_NAME

REMOTE_HOST

REMOTE_ADDR

REMOTE_USER

REQUEST_METHOD

QUERY_STRING

.......

GET and POST

• Two methods for sending Form Data

• GET• appends form data to url

GET /cgi-bin/globis.pl?name=globis HTTP/1.0

• POST• form data read from standard input

POST /cgi-bin/globis.pl HTTP/1.0....user =moirapass=fred

Server Side Includes

• Directives included in HTML Documents• execute programs

• output data such as environment variables

Server Side Includes ...

<html>

<head><title>Globis</title></head>

<body>

<h1>Welcome to </h1>

......

<address>Moira()</address>

</body>

</html>

Server Side Includes ......

Configure Server to say

• documents which should be parsed• AddType text/x-server-parsed-html .shtml

• AddType text/x-server-parsed-html .html

• directives supported• Includes - display environment variables etc.

• Exec - execute External Programs

Where to Cache?

• Caching can occur at many different levels and locations in web architectures

• Four fundamental ways for implementing a caching mechanism• browser caching

• proxy caching

• reverse proxy caching/server accelerators

• content delivery networks (CDN)

• We will go on to look at each of these in turn

Browser Caching

• Every browser contains cache of HTML docs & multimedia files

• Browser cache is a directory in user’s hard disk

• Advantages• simple

• universal

• Disadvantages• applies only to static resources

• can be by-passed by content provider who can add suitable HTTP headers to response or directives to HTML page forcing browser not to cache

Proxy Caching

• A proxy cache lies between a community of users (e.g. D-INFK, ETHZ) and the public internet

• Works on same basic principles as browser cache, but on much larger scale (may be hundreds or thousands of users)

• Proxy caches sometimes implemented together with firewalls which control flow of requests/responses between intranet and internet

• Client requests have to somehow be routed to proxy server• can be done through browser’s proxy setting

• interception proxies have requests redirected to them by underlying network

proxy server: cache miss

• http://some.host/path/doc.html

• http_proxy=http://www_proxy.my.domain

proxy server: cache hit

HTTP/1.0

• GET URL

• HEAD URLHTTP/1.1 200 OK

Date: Wed, 10 May 2000 09:33:08 GMT

Server: Apache/1.3.12 (Win32)

Last-Modified: Mon, 01 May 2000 13:37:40 GMT

Content-type: text/html

Content-length: 907

• GET URLIf-Modified-Since: Sunday, 05 Mar 2000 13:00:00 GMT

HEAD similar to GET

but only asks forresponse header

rather than content

Browser and proxy caches

• All caches have a set of rules used to determine what can be cachedand when to use cached resources• some rules set in protocols• some set by cache software (e.g. browser)• some set by cache administrator

• Many of these rules based on information in the HTTPrequest/response header• added by server/browser• explicitly generated by content provider• may be based on type of request or type of content• example: HTTP header containing Cache-Control: no-cache

General caching rules

• If response’s header says not to keep it, it won’t be cached

• If no validator (e.g. a Last-Modified header) is present on a response,it will be considered uncacheable

• If request is authenticated or secure, it won’t be cached

• A cached object is considered fresh (i.e. able to be sent to clientwithout checking with origin server) if• it has an expiry time or other age-controlling header set and is still fresh• if object already seen and browser cache set to check once a session• if proxy cache has seen object recently and long time since modified

• If a representation is stale, the server will be asked to validate it

HTTP header information for caching

• Example

HTTP validators and validation

• Validation used by servers and caches to communicate when an object has changed

• Most common validator is Last-Modified time• if cache has object with last-modified time t, generate If-Modified-Since t

request to server to check if object still current

• HTTP 1.1 introduced ETags as another kind of validator• every time object changes, server generates a unique identifier ETag which is

included in HTTP response header of object request• server controls how ETags generated

• Most modern web servers generate both ETags and Last-Modified validators automatically for static content

HTTP cache-control

• max-age=[seconds]

max amount of time page considered fresh; relative to time of request

• s-maxage=con [seds]

similar to max-age, except only applies to shared caches (e.g. proxies)

• public

marks authenticated responses as cacheable; normally if HTTP authentication required, responses uncacheable

• no-cache

forces cache to submit request to original server every time for validation before releasing cached copy

• no-store

instructs cache not to keep a copy under any circumstances

• must-revalidate

instructs cache to obey any freshness information given about an object; counteracts some conditions in which cache may serve stale representations

• proxy-revalidate

similar to must-revalidate, but only applies to proxy caches

What doesn’t work

• HTML metatags example<meta http-equiv="expires" content="Thu, 26 May 2005 10:50:00

GMT"><meta http-equiv="pragma" content="no-cache">

• easy to use, but are not very effective• HTML not usually read by proxy servers• few browsers honour such specifications

• Pragma HTTP headers• can include in HTTP header

Pragma: no-cache• HTTP specification does not specify how these should be handled and

many browsers ignore it

Problems of proxy servers

• Connections to servers still required

• Still a high server load

• Servers lose control over their documents• authorisation

• billing

• access statistics

Prefetching caches

• Request objects from the server without an explicit request

• Based on• access patterns• object analysis (HTML documents, ...)• explicit subscriptions

• Reduces latency

• If level of prefetching too high then may pay severe penalties in terms of• increased network traffic• server load

Proxy servers

• advantages• reduce latency, network bandwidth and server load

• opportunity to analyse an organisation's usage patterns

• transparent to clients and servers

• disadvantages• additional resources

• single point of failure

• chance that users get stale data from the cache

Reverse proxy caching

• Reverse proxy caches are also intermediaries, but instead of being deployed by network administrators are deployed by the webmasters themselves (i.e. server side)

• Improve web site’s• performance• reliability• scalability

• Typically some form of load balancer used to make one or more gateway caches look like the origin server to clients

• Sometimes known as “gateway caches”, “surrogate caches”, or “server accelerators”

Content Delivery Networks (CDNs)

• A content delivery network distributes gateway caches throughout the Internet (or part of it) and sells caching to interested web sites• Akamai (http://www.akamia.com)

• Original idea:• when a client requests a page to the origin server, the server returns a page

with rewritten links that point to a node of the CDN so that further client requests are handled by the CDN

• CDN serves requests using multiple cache nodes, selecting theoptimal copy of the page given the geographical location of the user and the real-time network traffic conditions

Content Delivery Networks (CDNs) ...

• CDNS now perform dynamic request routing using the Internet's Domain Name System (DNS)

• DNS is a distributed directory mapping fully qualified domain names(FQDN) to IP addresses

• To determine an FQDN's address, a DNS client sends a request to its local DNS server which then queries a set of authoritative servers

• When local DNS receives a response, it sends address to DNS client and saves it in cache

• Each DNS record has a time-to-live (TTL) field that tells DNS server how long it may cache result

• Normally the association of FQDN to IP address is static

Content Delivery Networks (CDNs) ......

• CDNs use modified DNS servers for CDN server selection

• Results of a DNS query to one of these servers may vary depending on source of request and network condition

• To enable fast reaction to dynamic resource changes, the answer returned by the CDN's DNS server has a small TTL

• This approach is largely transparent to client and works for any web content

• Issues• usually assumed client close to their local DNS servers• single request from a local DNS server can represent differing number of web clients

(hidden load factor)