+ All Categories
Home > Documents > HTML and Web Pages Transparency No. 1 HTML and Web Pages Cheng-Chia Chen.

HTML and Web Pages Transparency No. 1 HTML and Web Pages Cheng-Chia Chen.

Date post: 13-Dec-2015
Category:
Upload: norman-watts
View: 255 times
Download: 0 times
Share this document with a friend
54
HTML and Web Pages Transparency No. 1 HTML and Web Pages Cheng-Chia Chen
Transcript

HTML and Web Pages

Transparency No. 1

HTML and Web Pages

Cheng-Chia Chen

HTML and Web Pages

Transparency No. 2

outlines

HyperText, Markup language and HTMLURLs and related schemesBrief introduction to HTMLBrief introduction to CSSLimitations of HTMLunicodeThe World Wide Web Consortium (W3C)

HTML and Web Pages

Transparency No. 3

HTML

What is HTML ? a textual laguage for describing web pages. an acronym for Hyper-Text Markup Langugage. a combination of two concepts:

hypertext,

markup language

HyperText: Collections of documents connected by hyperlinks A concept predating WWW for many years (see textbook):

Paul Otlet, original philosophical treatise (1934) Vannevar Bush, hypothetical Memex system (automated desk in which docs stored

in microfilms and referenced by code numbers) (1945) --- first hypertext mechanism Ted Nelson: first working system and ther term 'hypertext' introduced (1968)

Hypermedia: generalizes hypertext beyond text

HTML and Web Pages

Transparency No. 4

Are languages that allow you to add markups to the data of a document. Markup is text (control codes, tags, etc.) that is added to the data of a

document in order to convey additional information about it can be used to make explicit the formal structure of text

pre XML(1998) Charles Goldfarb, Ed Mosher and Ray Lorie

invent GML in 1969 at IBM implement the INTIME system (1970)

culminated in Standard Generalized Markup Language, SGML (1986)Example: DTD, element, attribute, tag:

<!DOCTYPE greeting [ <!ELEMENT greeting (#PCDATA)> <!ATTLIST greeting style (big|small) "small"> ]><greeting style="big">hello world!</greeting>

Markup Languages

HTML and Web Pages

Transparency No. 5

The Origins of the WWW

WWW was invented by Tim Berners-Lee at CERN (1989)Hypertext across the Internet (replacing FTP)Three constituents: HTML + URL + HTTP

HTML for content representation URL for addressing HTTP for request/response content transfer

HTML is an SGML language for hypertextURL is an notation for locating files on serversHTTP is a high-level protocol for file transfers

HTML and Web Pages

Transparency No. 6

The Design of HTML

Simple, purist design principlesHTML describes the logical structure of a documentBrowsers are free to interpret tags differentlyHTML is a lightweight file formatSize of file containing just ”Hello World!”:

Postscript 11,274 bytes

PDF 4,915 bytes

MS Word 19,456 bytes

HTML 28 bytes

HTML and Web Pages

Transparency No. 7

The History of HTML

1992: HTML 1.0, Tim-Berners Lee original proposal1993: HTML+, some physical layout1994: HTML 2.0, standard with best features1995: Non-standard Netscape features1996: Competing Netscape and Explorer features1996: HTML 3.2, the Browser Wars end1997: HTML 4.0, stylesheets are introduced1999: HTML 4.01, slight modification to 4.02000: XHTML 1.0, an XML version of HTML 4.012001: XHTML 1.1, modularization2002: XHTML 2.0, simplified and generalized2010 : HTML 5, still working draft.and more ...

HTML and Web Pages

Transparency No. 8

Uniform Resource Locator referecnes:

spec: http://www.ietf.org/rfc/rfc1738.txt wiki: http://en.wikipedia.org/wiki/URL

A Web resource is located by a URL

http://www.w3.org:80/TR/html4/ ...

URL = [scheme:][//authority][path][?query][#fragment] Relative URL & query string:

employee/emp?id=12&name=jackFragment identifier (reference; ref)

http://www.w3.org/TR/HTML4/#minitocJava API for URL: Java.net.URL

scheme authority or server path

HTML and Web Pages

Transparency No. 9

URIs, URNs, and IRIs URI = URL + URN.

Means to describe all resources in the information space, even those that do not have a physical presence.

Can use ASCII code only.

references: // java API for URI : java.net.URI http://www.w3.org/Addressing/ http://en.wikipedia.org/wiki/Uniform_Resource_Identifier

Uniform Resource Identifier (URI) (RFC2936, 3986)scheme:scheme-specific-part Conventions about use of /, #, and ? common schems: http,https, ftp, mailto, file Use %hh for escaping of special char. E.g., # %23.

Uniform Resource Name (URN) (RFC2141) a pointer to a resource, but without a reference to its location. Ex: urn:isbn:0-471-94128-X urn:jdbc:...

International Resource Identifier (IRI) (RFC3987) Allow use of non-ascii code; mapped to URI by complex encoding function.

ex: http://www.blåbærgrød.dk/blåbærgrød.html -->

http://www.xn--blbrgrd-fxak7p.dk/bl%E5b%E6rgr%F8d.html

HTML and Web Pages

Transparency No. 10

Breif introduction to HTML

Overall structure of an HTML document

<html> <head> <title>The Title of the Document</title>

<!-- <meta /> <script /> <style /> <link /> -->

</head> <body bgcolor="white"> ... </body></html>

HTML and Web Pages

Transparency No. 11

Simple Formatting (1/2)<html> <head> <title>Good Advice</title> </head> <body> <h1>Good Advice for Everyday Life</h1> <h2>For UNIX programmers</h2> <b>Never</b> type: <p><tt>rm -rf /*</tt><p> on your computer. <h2>For Nuclear Scientists</h2> <b>Never</b> press the <i>Big <font color="red">Red</font> Button</i>. </body></html>

pre : preformatted text ul : unordred list li: list item ol: ordered list

h1 ~ h6. : headings; b : bold i : italics

tt: fixed-width font br: break of line font: font style designation

HTML and Web Pages

Transparency No. 12

Simple Formatting (2/2)

HTML and Web Pages

Transparency No. 13

More Formatting<html> <head> <title>Things To Do</title> </head> <body> <ol> <li>Feed the cat. <li>Try out the shell command: <pre>foreach x ( `ls` ) cat $x | tr "aeiouy" "x" > $xend</pre> <li>Buy ticket for Timbuktu. </ol> </body></html>

HTML and Web Pages

Transparency No. 14

Hyperlinks: Source Document

<html> <head> <title>Source Document</title> </head> <body> <a href="target.html#danger">Better look here</a>. </body></html>

HTML and Web Pages

Transparency No. 15

Hyperlinks: Target Document<html> <head> <title>Target Document</title> </head> <body> ... <a name="danger"></a> <h2>Chapter 17: Dangerous Shell Commands</h2> Never execute a shell command that inadvertently changes all vowels to the character 'x'. </body></html>

HTML and Web Pages

Transparency No. 16

Tables (link)

<table border="1"> <tr> <td>PostScript</td> <td align="right">11,274 bytes</td> </tr> <tr> <td>PDF</td> <td align="right">4,915 bytes</td> </tr> <tr> <td>MS Word</td> <td align="right">19,456 bytes</td> </tr> <tr> <td>HTML</td> <td align="right">28 bytes</td> </tr></table>

HTML and Web Pages

Transparency No. 17

Tables for Alignment

<table width="100%"> <tr> <td align="left"> <a href="index.html"><img src="home.gif" border="0"></a> <a href="info.html"><img src="info.gif" border="0"></a> </td> <td align="right"> <a href="links.html"><img src="left.gif" border="0"></a> <a href="survivor.html"><img src="right.gif” border="0"></a> </td> </tr></table><h1>Using Tables</h1>

HTML and Web Pages

Transparency No. 18

Fill-Out Forms ()Collects named values from the clientAllow clients to invoke remote services

<form method="get" action="http://www.google.com/search"> <input type="text" name="q"> <input type="submit" name="btnG" value="Google Search"></form>

HTML and Web Pages

Transparency No. 19

GUI Elements<input name="foo" type="text" size="20"><hr><input name="bar" type="radio" value="s">Small<input name="bar" type="radio" value="m">Medium<input name="bar" type="radio" value="l">Large<hr><input name="baz" type="checkbox" value="c">Cheese<input name="baz" type="checkbox" value="p">Pepperoni<input name="baz" type="checkbox" value="a">Anchovies<hr><select name="bar"> <option value="s">Small <option value="m">Medium <option value="l">Large</select><hr><select name="baz" multiple> <option value="c">Cheese <option value="p">Pepperoni <option value="a">Anchovies</select>

Types of input:

text, password,

radio, checkbox, button

hidden,

file,

reset,

image, submit.

- New Input Types in HTML5

HTML and Web Pages

Transparency No. 20

GUI Elements<hr><textarea name="foo" rows="5" cols="20">Write something here...</textarea><hr><input name="foo" type="password" value="tomato"><hr><input name="foo" type="file"><hr><input name="foo" type="hidden" value="you can't see this"><hr><input name="qux" type="image" src="Denmark.gif"><hr><input type="submit" value="Submit this form"><hr><input type="reset" value="Reset this form” >

HTML and Web Pages

Transparency No. 21

Logical Versus Physical

(Logical) structure

• the page starts with a header• the entries are written in a list• numbers are emphasized

(Physical) layout&rendering

• headers are centered, huge, and grey• lists have square bullets• emphasis is rendered in bold-style italics

● To promote the reuse and unification of document styles,It is desirable to separate the physical layout and presentation info of a document from its structure.

HTML and Web Pages

Transparency No. 22

Brief introduction to to CSS (,)

Cascading Stylesheets separate structure from layoutThe essential concepts are selectors and propertiesProperties may have different values:

color red, yellow, rgb(212,120,20)

font-style normal, italics, oblique

font-size 12pt, larger, 150%, 1.5em

text-align left, right, center, justify

line-height normal, 1.2em, 120%

display block, inline, list-item, none+

+ display: none => hide the content and it takes no space1em = the length of "M"; 1ex = the length of "x" .

HTML and Web Pages

Transparency No. 23

Structure of a Stylesheet

A selector is a list of tag namesFor each selector, some properties are assigned

values: b {color: red; font-size: 12pt} i {color: green}

Longer selectors give context sensitivity: table b {color: red; font-size: 12pt} form b {color: yellow; font-size: 12pt} i {color: green}

The most specific selector is chosen to apply

HTML and Web Pages

Transparency No. 24

Specificity in Action

<head> <style type="text/css"> b {color: red;} b b {color: blue;} b.foo {color: green;} b b.foo {color: yellow;} b.bar {color: maroon;} </style> <title>CSS Test</title></head>

<body> <b class=foo>Hey!</b> <b>Wow! <b>Amazing!</b> <b class=foo>Impressive!</b> <b class=bar>k00l!</b> <i>Fantastic!</i> </b> </body>

Hey! Wow! Amazing! Impressive! K00l! Fantastic!

HTML and Web Pages

Transparency No. 25

Applying a Stylesheet

h1 { color: #888; font: 50px/50px "Impact"; text-align: center; }ul { list-style-type: square; }em { font-style: italic; font-weight: bold; }

<html> <head> <title>Phone Numbers</title> <link href="style.css" rel="stylesheet" type="text/css"> </head> <body> <h1>Phone Numbers</h1> <ul> <li>John Doe, <em>(202) 555-1414</em> <li>Jane Dow, <em>(202) 555-9132</em> <li>Jack Doe, <em>(212) 555-1742</em> </ul> </body></html>

HTML and Web Pages

Transparency No. 26

HTML Validity

HTML has a formal syntax specification800 lines of DTD notationA validator gives syntax errors for invalid documentsMost HTML documents on the Web are invalid (data taken

from the textbook(2006) ):

Valid documents may contain this logo:available validator : a unifying validating service (unicorn)

www.microsoft.com 123 errors

www.cnn.com 58 errors

www.ibm.com 30 errors

www.google.com 27 errors

www.sun.com 19 errors

HTML and Web Pages

Transparency No. 27

Validation ErrorsLine 3, column 7: document type does not allow element "BODY" here. <body> ^Line 4, column 13: document type does not allow element "B" here; assuming missing "CAPTION" start-

tag <table><b>123</i></table> ^Line 4, column 20: end tag for element "I" which is not open. <table><b>123</i></table> ^Line 4, column 28: end tag for "B" omitted, but its declaration does not permit this. <table><b>123</i></table> ^Line 4, column 11: start tag was here. <table><b>123</i></table> ^Line 4, column 28: end tag for "CAPTION" omitted, but its declaration does not permit this. <table><b>123</i></table> ^Line 4, column 11: start tag was here. <table><b>123</i></table> ^Line 4, column 28: end tag for "TABLE" which is not finished. <table><b>123</i></table> ^Line 6, column 6: end tag for "HTML" which is not finished. </html>

<html> <body> <table><b>123</i></table> </body></html>

HTML and Web Pages

Transparency No. 28

Reasons for Invalidity

Ignorance of the HTML standardLack of testing

”This page is optimized for the XYZ browser” ”This page is best viewed in 1024x768”

Automatic tools generate invalid HTML outputForgiving browsers try to interpret invalid input

<h2>Lousy HTML</h1><li><a>This is not very</b> good.<li><i>In fact, it is quite bad</em></ul>But the browser does <a naem="goof">something.

HTML and Web Pages

Transparency No. 29

Problems with Invalidity

There are several different browsersEach browsers has many different implementationsEach implementation must interpret invalid HTMLThere are many arbitrary choices to make

The HTML standard has been underminedHTML renders differently for most clients

HTML and Web Pages

Transparency No. 30

A Standard for Invalid HTML

The HTML Tidy tool tries to save the situation Invalid HTML is transformed to (almost) valid HTMLStill many arbitrary choices, but now we agree

<html><head><title></title></head><body><h2>Lousy HTML</h2><ul class="noindent"><li><a>This is not very good.</a></li><li><i>In fact, it is quite bad</i></li></ul>But the browser does <a naem="goof">something.</a></body></html>

<h2>Lousy HTML</h1><li><a>This is not very</b> good.<li><i>In fact, it is quite bad</em></ul>But the browser does <a naem="goof">something.

HTML and Web Pages

Transparency No. 31

HTML for Recipes<h1>Rhubarb Cobbler</h1><h2>Wed, 4 Jun 95</h2>This recipe is suggested by Jane Dow.Rhubarb Cobbler made with bananas as the main sweetener.It was delicious.

<table> <tr><td> 2 1/2 cups <td> diced rhubarb <tr><td> 2 tablespoons <td> sugar <tr><td> 2 <td> fairly ripe bananas <tr><td> 1/4 teaspoon <td> cinnamon <tr><td> dash of <td> nutmeg </table>

<i>Combine all and use as cobbler, pie, or crisp.</i><p>This recipe has 170 calories, 28% from fat,58% from carbohydrates, and 14% from protein.<p>Related recipes: <a href="#GardenQuiche">Garden Quiche</a>is also yummy.

HTML and Web Pages

Transparency No. 32

Limitations of HTML

HTML is designed for hypertext, not for recipes Structure and presentation is intertwined HTML validation is less than recipe validation HTML standards have been undermined

We need a special Recipe Markup Language!

HTML and Web Pages

Transparency No. 33

Unicode : Bytes vs. Characters

HTML files are represented as text filesA text file is logically a sequence of characters

But physically a sequence of bytes

Several mappings exist: ASCII EBCDIC Unicode

Unicode aims to cover all characters in all past or present written languages

HTML and Web Pages

Transparency No. 34

Unicode Characters

A character is a symbol that appears in a text letters of the alphabet pictograms (like ©) modifiers (such as accents) : å

Unicode characters are abstract entities described by names such as: LATIN CAPITAL LETTER A LATIN CAPITAL LETTER A WITH RING ABOVE HIRAGANA LETTER SA (さ ) RUNIC LETTER THURISAZ THURS THORN

No graphical nor byte representations by themselves.

HTML and Web Pages

Transparency No. 35

Unicode Glyphs

A glyph is a graphical presentation of a part of a text.A typical example is: Å // as a glyphThis glyph may represent either of the characters:

LATIN CAPITAL LETTER A WITH RING ABOVE ANGSTROM SIGN

Or even a sequence of two characters: LATIN CAPITAL LETTER A +

COMBINING RING ABOVE

There are cases in which a single character maps to two or more glyphs

Conclusion: It is not a 1-1 mapping between chars and glyphs.

HTML and Web Pages

Transparency No. 36

Unicode Code Points

A code point is a unique number assigned to every Unicode character

Code points are between 0 and 1,114,112 organized into 17 planes ; 1 plane = 256 page; 1 page has 256 code points.

Only around 100,000 are used todayThe character HIRAGANA LETTER SA ( さ )is

assigned the code point 12,373Code point 0 through 127 coincide with ASCIISome code points are never assigned to characters.

11011--- -------- reserved for UTF16 encoding.

HTML and Web Pages

Transparency No. 37

Unicode Character Encoding

A character encoding is used to interpret a sequence of bytes as a sequence of code points

The bytes are first parsed into code unitsCode units have a fixed lengthOne or more code units may be required to denote a

code pointExamples are UTF-8, UTF-16, UTF-32

HTML and Web Pages

Transparency No. 38

UTF-8

A code unit is a single byteA code point is from 1 to 4 code unitsCode units between 0 and 127 directly represent the

corresponding code points 0000000~01111111 represent themselves.

110XXXXX indicates that 2 code units are used1110XXXX indicates that 3 code units are used11110XXX indicates that 4 code units are usedThe remaining code units looks like 10XXXXXXPossible forms of a UTF-8 code:

1. 0xxxxxxx ; 2. 110xxxxx 10xxxxxx3. 1110xxxx 10xxxxxx 10xxxxxx;4. 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

HTML and Web Pages

Transparency No. 39

UTF-8 Example

Given the sequence : 11100011 10000001 10010101

The fist byte indicates that the first codepoint needs 3 bytes: 11100011 10000001 10010101

After extracting all content bits: 0011 0000 0101 0101

We form a two-byte codpoint : 0x3055 or 12,373,

which is the character HIRAGANA LETTER SA

HTML and Web Pages

Transparency No. 40

UTF-16A code unit consists of 2 bytesCode point below 65,536 are in a single code unitHigher code points y are represented as:

110110XX XXXXXXXX 110111XX XXXXXXXX where xxx…xxx = (y-216)2 or y = (xx…xxx)2 + 216

They are rarely used!!

But then 11011000 00000000 represent copoint 0xd800 or the half of a copoint > 65535 ?

This can be resolved because Unicode assign no code points between the numbers:

11011000 00000000 (55,296; 0xd800) and

11011111 11111111 (57,343; 0xdfff)

HTML and Web Pages

Transparency No. 41

Byte Order

When reading several bytes at once, we must consider the byte order of the architecture

UTF-16 starts any text with the special code point:

11111110 11111111 (65,279; 0xFEFF)

called zero-width non-breaking space

(BOM; Byte order Mark)The dual code point

11111111 11111110 (65,534; 0xFFFE)

is never assigned (i.e., not a legal code point).UTF-16LE (LSB First) and

UTF-16BE (MSB first) may avoid this.

HTML and Web Pages

Transparency No. 42

UTF-16 Example

11111110 11111111 00110000 0101010111111110 11111111 00110000 01010101 (BigEndin) 00110000 01010101 12,373 (0x3055) HIRAGANA LETTER SA

11111111 11111110 00110000 0101010111111111 11111110 00110000 01010101 (LittleEndin) 01010101 001100000x5530 != 12,373

HTML and Web Pages

Transparency No. 43

ISO-8859-1

Another popular character encodingOnly 256 code pointsSingle byte code unitsCoincides with ASCII on code points 0-127Cannot represent general Unicode

In all, there are hundreds of different encodings... ISO-8859-1 ~ISO-8859-9 big5, MS950, ...

HTML and Web Pages

Transparency No. 44

Character Encodings in HTML

The document may declare its own encoding:

<meta http-equiv="Content-Type"

content="text/html; charset=ISO-8859-1">

For the meta tags to be parsed correctly the encoding should coincides with ASCII.

Unicode characters may be represented as character reference of the form:

&#12373; // decimal

&#x3055; // hexidecimal

HTML and Web Pages

Transparency No. 45

Unicode in Java

Java represents characters as UTF-16 code units Not as UTF-16 code points!

A pragmatic choice to use only 16 bitsThe length function on strings may be wrong

returns the number of code units

Some strings may represent illegal data eg: char c = 0xFFFE ; // not a legal unicode codpoint

Java uses classes java.io.InputString/OutputStream to represent byte streams while using java.io.Reader/Writer to represent I/O chactcter streams.

HTML and Web Pages

Transparency No. 46

Different ways to get an InputStream

There are many ways to get an InputStream depending the input source: From a File => new FileInputStream(File | “FileName”) From byte[] => new ByteArrayInputStream(byte[] ) From URLConnection => new URL(“http://...”).openStream() or con = new URL(“…”).getConnection(); con.connect(); con.getInputStream() From java.net.Socket ServerSocket

New Socket(“remoteHost”, port).get{Input/OutputStream}() New ServerSocket(port).accept().get{Input/Output}Stream() ;

HTML and Web Pages

Transparency No. 47

From bytes to characters in Java

InputStreams allow you to read in data in bytes, but how to read in characters from an input stream ?

Ans: Use a byte-char transformer: InputStreamReader.InputStream in = … // Charset or CharsetDecoderReader rd = new InputStreamReader(in, “UTF-8”))

int ch = rd.read(); rd.read(new char[100], 0, 50 )

BufferedReader brd = new BufferedReader(rd); while(s = brd.readLine() != null) out.println (s) ;

HTML and Web Pages

Transparency No. 48

From characters to bytes in java

Similarly, character data can be written to a Writer, which can then be transformed into bytes via OutputStreamWriter.

Ex: Write a String s = ”政大資科系” into bytes in “UTF-8” format in file “f1.txt”. OutputStream ops = new FileOutputStream(“f1.txt) ; // or ops = new Socket(…).getOutputStream() ; // or ops = new ByteArrayOutputStream() BufferedWriter wr = new BufferedWriter( new OutputStreamWriter( // Charset or CharsetEncoder new FileOutputStream(“f1.txt”), “UTF-8”) ); wr.write(s, 0, s.length) ; // use default charset // cf: new FileWriter(“f1.txt”).write(s,0,s.length) ;

HTML and Web Pages

Transparency No. 49

World Wide Web Consortium (W3C)

web site: http://www.w3.orgDevelops HTML, CSS, and most Web technologyFounded in 1994Has hundrends of companies and organizations as members Is directed by Tim Berners-LeeLocated at MIT (US), Inria (France), Keio university (Japan)

HTML and Web Pages

Transparency No. 50

W3C Players

Members TeamAdvisory boardTechnical Architecture GroupWorking Groups

HTML and Web Pages

Transparency No. 51

W3C Documents

Working DraftsCandidate RecommendationsProposed RecommendationsRecommendations

Working Group NotesMember SubmissionsStaff CommentsTeam Submissions

HTML and Web Pages

Transparency No. 52

W3C Principles

Consensus among membersLimited intellectual property rightsFree Web access to technical reports (unlike ISO)

HTML and Web Pages

Transparency No. 53

Summary

History and structure of HTML and CSSSurvivor’s guides to these technologiesLimitations of HTML for general data

HTML and Web Pages

Transparency No. 54

Essential Online Resources

http://www.w3.org/TR/html4/http://www.w3.org/Addressing/http://www.w3.org/Style/CSS/http://validator.w3.org/http://www.w3.org/


Recommended