+ All Categories
Home > Documents > Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for...

Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for...

Date post: 04-Aug-2020
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
45
HG2052 Language, Technology and the Internet The World Wide Web and HTML Francis Bond Division of Linguistics and Multilingual Studies http://www3.ntu.edu.sg/home/fcbond/ [email protected] Lecture 6 HG2052 (2020)
Transcript
Page 1: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

HG2052Language, Technology and the Internet

The World Wide Web and HTML

Francis BondDivision of Linguistics and Multilingual Studies

http://www3.ntu.edu.sg/home/fcbond/[email protected]

Lecture 6

HG2052 (2020)

Page 2: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Revision of Collaboration and Wikis

ã Version Control Systems

ã Wikipedia

ã Licensing and Ownership

1

Page 3: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Version Control Systems

ã Versioning file systems

â every time a file is opened, a new copy is stored

ã CVS, Subversion, Git

â changes to a collection of files are trackedâ simultaneous changes are merged

ã Revision Tracking

â Revisions are stored within a file

ã Authorship in shared writing

The World Wide Web and HTML 2

Page 4: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Wikipedia

ã The core aim of the Wikimedia Foundation, is to get a free encyclopedia to everysingle person on the planet. (Jimmy Wales)

ã Wikipedia makes it easy to share your knowledgepeople like to do this

ã Most edits are done by insiders!

ã Most content is added by outsiders!

ã Content comparable to Britannica

The World Wide Web and HTML 3

Page 5: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

The five pillars of Wikipedia

1. Wikipedia is an online encyclopedia

2. Wikipedia has a neutral point of view.

3. Wikipedia is free content

4. Wikipedians should interact in a respectful and civil manner

5. Wikipedia does not have firm rules

Wikipedia:Fivepillars 4

Page 6: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Licenses and Ownership

ã Copyright

ã Copyleft

ã Creative Commons

The World Wide Web and HTML 5

Page 7: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

What is a good article?

1. Well-written

2. Factually accurate and verifiable

3. Broad in its coverage

4. Neutral

5. Stable

6. Illustrated, if possible, by images

Wikipedia:Good_article_criteria 6

Page 8: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

The World Wide Weband HTML

7

Page 9: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Overview

ã The Internet

ã The structure of Markup

ã The structure of the Web

ã The future of the Web

ã Linguistic features of the web

The World Wide Web and HTML 8

Page 10: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

The Internet

ã global system of interconnected computer networks that use the standard InternetProtocol Suite (TCP/IP)

ã Carries several services

â HTTP (Hyper Text Transfer Protocol) — The Webâ Emailâ VoIP (Voice over IP) — Telephony/Skypeâ FTP, …(File Transfer)â Streaming Media — music, videoâ Instant Messaging

The World Wide Web and HTML 9

Page 11: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Map of online communities (2007)

http://xkcd.com/256/ 10

Page 12: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Map of online communities (2010)

http://xkcd.com/802/ 11

Page 13: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Growth of the Internet

46

51

54

59 61

63

67

71

11

16

68

9

12

0 1 12 3 4

7

2 3 57 8

1214

18

11

17

24

31

38

42

21

30

1518

33

2423

26

21

* Estimate

36

73*77*

28*31*

36*

39*

1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

80

70

60

50

40

30

20

10

0

https://commons.wikimedia.org/wiki/File:Internet_users_per_100_inhabitants_ITU.svg 12

Page 14: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Markupformatting information

The World Wide Web and HTML 13

Page 15: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Why Markup?

ã Reduce Ambiguity

â Need to make meaning explicit

ã Traditionally this is done by annotating text in some way

14

Page 16: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Markup Languages

ã Annotation on how to print is called markup

â underlining to indicate boldfaceâ special symbols for passages to be omittedâ special symbols for printed in a particular font

ã This existed before computers

â Editors would markup hand-written manuscriptsâ …and pass them to type settersâ …who would prepare the manuscript for printing

The World Wide Web and HTML 15

Page 17: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Printers’ Markup

The World Wide Web and HTML 16

Page 18: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Early Computer Markup (troff)

Headlineand some text

.ps 12 % point size 12

.ft B % font type BoldHeadline.ps 10 % point size 10.ft R % font type Romanand some text.

ã Marked up with troff

ã Postscript and PDF (Portable Document Format) are similar

The World Wide Web and HTML 17

Page 19: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Visual Markup vs Logical Markup

ã Visual Markup (Presentational)

â What you see is what you get (WYSIWYG)â Equivalent of printers’ markupâ Shows what things look like

ã Logical Markup (Structural)

â Shows the structure and meaningâ Can be mapped to visual markupâ Less flexible than visual markupâ More adaptable (and reusable)

The World Wide Web and HTML 18

Page 20: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Standard Generalize Markup Language: SGML

ã ISO standard based on IBM’s GML

ã Attempt to make markup independent of processor

â Important for archiving information

ã Emphasis on logical markup

ã Popularized the use of <tag></tag> notation

â and entities &lt; &gt; when you need an <>

ã Split the document into: Declaration, Prolog, Documentation

The World Wide Web and HTML 19

Page 21: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Hyper Text Markup Language: HTML

ã Markup Language for web pages

ã An extension of SGML

ã Combines logical and visual markup

ã Also allows hyperlinks (linking and anchoring)

ã Created by Tim Berners-Lee at CERN (1989)

â to make physics papers and documentation more accessible

The World Wide Web and HTML 20

Page 22: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

HTML example

Headlineand some text

ã Logical

<h1>Headline</h1><p>and some text

ã Visual

<font size="3"><b>Headline</b></font><br>and some text

The World Wide Web and HTML 21

Page 23: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Logical allows various styles

Headlineand some text

<style>H1 {

font-size:24px;color:blue;margin-top:10px;margin-bottom:15px;

}</style>

ã This can be done using CSS (Cascading Style Sheets)

ã Separate Logical and Visual Structure

The World Wide Web and HTML 22

Page 24: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Benefits of Logical Tags

ã Can transform things easily

â No bold for Japanese and Chinese (just use size)â Can adapt to other modalities (speech)

ã Logical form useful for other tasks

â Summarization∗ Just show <h1> … <h3>

â Translation∗ Headers are noun phrases, not sentences

ã Robustness: you can read the source directly

The World Wide Web and HTML 23

Page 25: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

But still there is ambiguity!

ã Tags on one site may not mean the same thing on another site

ã Huge amount of information

â Looking for Eric Miller may get the wrong one!â Looking for NTU gets

∗ Nanyang Technological University∗ National Taxpayers Union∗ National Taiwan University

ã What can we do?Semantic Web (week 10)

The World Wide Web and HTML 24

Page 26: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Hypertext

ã HTML crucially adds hyperlinks

â these extend text in a new wayâ references that you can immediately access

ã <href="http://somewhere.on.the.web">link me</a>

ã <img src="http://somewhere.on.the.web/pic.jpg">

ã Immediately accessible references are qualitatively different

The World Wide Web and HTML 25

Page 27: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

HTML example

<!doctype html><html>

<head><title>Hello HTML</title>

</head><body>

<p>Hello World!</p><p>Oh well, <span lang="fr">c'est la vie</span>,

as they say in France.</p><abbr id="anId" class="jargon" style="color:blue;"

title="Hypertext Markup Language">HTML</abbr></body>

</html>

The World Wide Web and HTML 26

Page 28: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

How should you hyperlink?

ã Pick a page

â This course pageâ LMS research pageâ Wiki front pageâ Your choice

ã Discuss whether you think there are enough links or too many or not enough? Andare they linking to the best targets?

ã You may wish to look at the Wikipedia:Manual of Style/Linking<https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Linking>

Inspired by Crystal (2011, p 28) 27

Page 29: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

The Structure of the Web

ã 550 billion documents on the Web (2001)mostly in the invisible Web, or deep Web

ã 11.5 billion indexable web pages (2005)

ã 25.21 billion indexable web pages (2009)

ã 109.5 million websites (2009)

Wikipedia:WorldWideWeb 28

Page 30: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

The Deep Web

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure, NTULearn)

Contextual Web pages with content varying for different access contexts (e.g.,ranges of client IP addresses or previous navigation sequence).

Limited access content sites that limit access to their pages in a technical way (e.g.,using the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions.

The World Wide Web and HTML 29

Page 31: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Non-HTML/text content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines.

These pages all include data that search engines cannot find!

The World Wide Web and HTML 30

Page 32: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

robots.txt

ã A Robot (Web Crawler, or Spiders) is a program that automatically traverses theWeb’s hypertext structure by retrieving a document, and recursively retrieving alldocuments that are referenced. Robots are used for:

â Indexing and What’s New monitoringâ HTML and Link validationâ Mirroring and back up

ã A website can explicitly tell robots where they can and cannot go

â Compliance is voluntary, but followed by most robots

ã You can Allow and Disallow whole directories, or individual pages

ã You can Allow and Disallow individual user-agents (such as Google)

http://www.robotstxt.org 31

Page 33: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

The Internet and Language Diversity

32

Page 34: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Distribution of languages among Internet users

From Global Reach (2006) cited in Gerrand (2007) 33

Page 35: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Internet users by language, February 2005

Source: OECD (2006) cited in Gerrand (2007) 34

Page 36: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Language of e-commerce, February 2005

Source: OECD (2006) references to secure servers by language cited in Gerrand (2007) 35

Page 37: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Percentage of Web sites by language (2014)

Others

Dutch

Turkish

Polish

Italian

Portuguese

Chinese

French

Spanish

Japanese

German

Russian

English

0% 5% 10% 20% 30% 40% 50%

https://en.wikipedia.org/wiki/Languages_used_on_the_Internet 36

Page 38: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Percentage of Web users by language (2014)

Others

Korean

Russian

French

Arabic

German

Portuguese

Japanese

Spanish

Chinese

English

0% 5% 10% 15% 20% 25%

https://en.wikipedia.org/wiki/Languages_used_on_the_Internet 37

Page 39: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Gradually Changing

https://www.internetworldstats.com/stats7.htm 38

Page 40: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

The Internet and Language Diversity

ã Major languages will survive (not just English)

ã Sarnoff’s Law: the value of a broadcast network is proportional to the number ofviewers (n)

ã Metcalfe’s Law: the value of a telecommunications network is proportional to thesquare of the number of connected users of the system (n2)

⇒ languages with more pages will become even more valuable

ã Minor languages probably won’t survive

39

Page 41: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Top ten Wikipedias

See also http://meta.wikimedia.org/wiki/List_of_WikipediasWikipedias in 272 languages: only 96 with more than 10,000 pages

http://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia 40

Page 42: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

The next 5,000 days of the Web

ã Kevin Kelly on the next 5,000 days of the web (20min)

ã http://www.ted.com/talks/lang/eng/kevin_kelly_on_the_next_5_000_days_of_the_web.html

ã The impossible has become possible

ã The web is a single machine

â Embodimentâ Re-structuringâ Co-dependence

41

Page 43: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Linguistic features of the web

ã Much/most text is just the same

ã Un-edited

ã Accessible in great volume (and many languages)

ã Editable — Wikis, comments, tweets

ã Multi-media

The World Wide Web and HTML 42

Page 44: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

Conclusion

ã The web is changing what humanity can do with language

ã It is not clear if it is changing what individual humans do

ã Make sure you go through the wikipedia tutorial

The World Wide Web and HTML 43

Page 45: Lecture 6: The World Wide Web and HTML · Hyper Text Markup Language: HTML ª Markup Language for web pages ª An extension of SGML ª Combines logical and visual markup ª Also allows

References

ã Crystal, D. (2011). Internet Linguistics: a student guide. Routledge

ã Peter Gerrand (2007) Estimating linguistic diversity on the Internet: A taxonomy toavoid pitfalls and paradoxes. Journal of Computer-Mediated Communication, 12(4),article 8. http://jcmc.indiana.edu/vol12/issue4/gerrand.html

ã Global Reach. (2006). Global Internet Statistics (by Language). Retrieved October11, 2006 from http://www.global-reach.biz/globstats/index.php3

The World Wide Web and HTML 44


Recommended